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PREFACE 



Selected papers presented at the 22^^ Annual Conference of the German 
Classification Society GfKl (Gesellschaft fiir Klassifikation) , held at the Uni- 
versity of Dresden in 1998, are contained in this volume of “Studies in Clas- 
sification, Data Analysis, and Knowledge Organization”. 

One aim of GfKl was to provide a platform for a discussion of results con- 
cerning a challenge of growing importance that could be labeled as “Classi- 
fication in the Information Age” and to support interdisciplinary activities 
from research and applications that incorporate directions of this kind. 

As could be expected, the largest share of papers is closely related to classi- 
fication and-in the broadest sense-data analysis and statistics. Additionally, 
besides contributions dealing with questions arising from the usage of new 
media and the internet, applications in, e.g., (in alphabetical order) archeolo- 
gy, bioinformatics, economics, environment, and health have been reported. 
As always, an unambiguous assignment of results to single topics is some- 
times difficult, thus, from more than 130 presentations offered within the 
scientific program 65 papers are grouped into the following chapters and 
subchapters: 

• Plenary and Semi Plenary Presentations 

- Classification and Information 

— Finance and Risk 

• Classification and Related Aspects of Data Analysis and Learning 

— Classification, Data Analysis, and Statistics 

— Conceptual Analysis and Learning 

• Usage of New Media and the Internet 

- Information Systems, Multimedia, and WWW 

- Navigation and Classification on the Internet and Virtual Univer- 
sities 

• Applications in Economics 

- Finance, Capital Markets, and Risk Management 

- Marketing and Market Research 

• Applications in Archeology, Bioinformatics, Environment, and Health 

Within the (sub)chapters the presentations are listed in alphabetical order 
with respect to the authors’ names, with the only exception, that papers 
of authors of plenary talks are arranged before semi plenary contributions. 
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At the end of this volume, a list of key words is included that, additionally, 
should help the interested reader. 

Last but not least, we would like to thank all participants of the conference 
for their contributions which, again, have made the annual GfKl-meeting and 
this volume an interdisciplinary possibility for scientific discussion, in parti- 
cular all authors and all colleagues who reviewed papers, chaired sessions or 
were otherwise involved. 

In this context, we gratefully take the opportunity to acknowledge support 
by 

• Deutsche Forschungsgemeinschaft (DFG) 

• Fakultat fiir Wirtschaftswissenschaften der Technischen Universitat 
Dresden 

• Sachsisches Ministerium fiir Wissenschaft und Kunst 
and 

• Bayerische Vereinsbank AG 

• Bayerische Hypotheken- und Wechselbank AG Filiale Dresden 

• Commerzbank AG Filiale Dresden 

• Deutsche Telekom AG Niederlassung II Dresden 

• Nomina GmbH 

• Siemens AG 

• SRS Software und Systemhaus Dresden GmbH 

The volume was put together at the University of Karlsruhe and here, at 
least, Michael Loffler has to be mentioned who helped in a way that outside 
persons cannot appreciate. 

Finally, we thank Springer Verlag, Heidelberg, for excellent cooperation in 
publishing this volume. 
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Wolfgang Gaul 
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PLENARY AND SEMI PLENARY 

PRESENTATIONS 



Classification and Information 




Scientific Information Systems and Metadata 

M. Grotschel, J. Lugger 

Konrad- Zuse-Zentrum fiir Informationstechnik Berlin (ZIB) 

Takustr. 7, 14195 Berlin-Dahlem, Germany 

Abstract: This article begins with a short survey on the history of the classi- 
fication of knowledge. It briefiy discusses the traditional means of keeping track 
of scientific progress, i.e., collecting, classifying, abstracting, and reviewing all 
publications in a field. The focus of the article, however, is on modern electronic 
information and communication systems that try to provide high-quality informa- 
tion by automatic document retrieval or by using metadata, a new tool to guide 
search engines. We report, in particular, on efforts of this type made jointly by 
a number of German scientific societies. A full version of this paper including all 
hypertext references, links to online papers and references to the literature can 
be found under the URL: http://elib.zib.de/math.org.softinf.pub 



Introduction 

The exponential growth of information, in particular in the sciences, is a 
topic discussed broadly. The problems arising by this increase are treated in 
depth in Odlyzko (1995). This development cries for adequate organization 
of knowledge and for efficient means of retrieval of information. The infor- 
mation flood is fostered by the tools provided by the Internet. At the same 
time, these technological developments seem to make efficient information 
handling feasible. This will be discussed in this paper. 

Before we do that let us see how previous generations have coped with the 
problems of making knowledge accessible. The most prominent approach 
was “classification” . The uninitiated observer may believe that classification 
was nothing but a way to organize knowledge so that relevant information 
can be found easily. The claim ^‘Classification is power” may thus come as 
a surprise. However, this is nothing but our concise synopsis of the article 
Darnton (1998). It also follows from a combination of John Dewey’s claim 
“Knowledge is classification” and Francis Bacon’s “Knowledge is power”. 
Let us explain these statements. Whenever organized information is offered 
an explicit or implicit classification is used. In the Internet organized infor- 
mation is provided through, e.g., a universal virtual library such as Yahoo!, 
a subject-specific list of links like Math-Net-Links, a search engine such as 
GERHARD (see Koch et al. (1997) for URLs and more detailed informa- 
tion about the role of classification schemes in the Internet). Why is such a 
classification the execution of power? The key observation is that the classi- 
fier decides which topic is important (high on the list, or on the list at all); 
the search engine designer does the same through the rules of his ranking 
algorithm. He may manipulate the world by listing information first he likes 
best (or for which he gets paid) in the same way as an encyclopedist focuses 
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the attention of a reader along the lines and branches of his design of the 
“tree of knowledge”. Darnton (1998) outlines this aspect with respect to 
Diderot’s Encyclopedie. 

D’Alembert (1751/1998) discusses Diderot’s plans to design a world map of 
knowledge that can be navigated easily. He also justifies their joint decision 
to abandon this plan because one could Hhink of as many scientific sys- 
tems as world maps of different views, whereby each of these systems has a 
specific exclusive advantage in favor of the others”. They observed that all 
arrangements of knowledge are arbitrary, and that each has a great num- 
ber of inherent defects and unsolvable contradictions, in particular, if the 
evolution of knowledge over time is taken into account. Thus, they decided 
for alphabetic ordering, nowadays called lexicographic index, which is the 
traditional way of “implementing” information retrieval. In fact, 250 years 
later the developers of Alta Vista, one of the most successful Internet search 
engines, were confronted with exactly the same problem, namely “to index 
or to classify”, see Seitzer et al. (1997). They decided for indexing and in- 
formation retrieval and against a hierarchic classification, e.g., as employed 
by Yahoo!. 

The strengths and weaknesses of classification and information retrieval are 
widely discussed, in particular by those who dream of a “universal, heteroge- 
neous, world-wide digital library”. It follows from d’Alembert’s observation 
that a single universal classification scheme or an information retrieval mech- 
anism alone do not suffice to create such a library. The new idea is to furnish 
data with additonal data about these data, called metadata, that allow to 
view the world from different perspectives. 

Although the initial goal of the “metadata move” was to help the authors 
of web resources to make the results of their work more visible, the cur- 
rent development aims at more general goals. Both, the attributes and the 
contents of metadata are still in the design process. The initiative is fos- 
tered not only by the providers of digital libraries. It gained momentum 
by a broad acceptance within the group of the more traditional contents 
providers (publishers, museums, libraries, archives, document delivery ser- 
vices, etc.). Thus, it now also aims at providing suitable metadata for all 
kinds of traditional documents and for the more complex digital items in the 
web (which do not only include books and papers, but also videos, music, 
multimedia information, hypertexts), now termed document-like objects. 
Before stepping into the details of metadata let us briefly review some of the 
developments, relevant for the topic, that took place before the birth of the 
web. 



1 Classification in the Sciences 

Metadata have been invented to form a basis for navigation in data sets 
of large scale. They have not been conceived in the context of classifica- 
tion. Classification systems have been designed for structuring the body of 
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knowledge so that new information can be incorporated easily and already 
archived information can be found effectively. Although metadata and clas- 
sification systems aim at completely different targets, there is an intimate 
relationship. We want to explore this briefly. We start with a short sketch 
of the history of classification in the sciences. 

Classification is mainly motivated by two objectives: 

• to introduce structure into masses of facts, 

• to build a “unified and homogeneous” view of the “world” . 

History bore witness for some of the potentials of classification. 

Lenne’s classification of plants and animals started biology as a science in 
the 18th century, see Rossi (1997). Lenne’s system is still in use today. In 
fact, Lenne can be viewed as the father of modern taxonomy. To name 
another example, modern chemistry started as a science with the periodic 
system, which was “found” in the middle of the 19th century after a number 
of false starts, see Bensande-Vincent (1989). 

Designers of classification systems always tried to follow three principles. 
Their system should be organic, simple to grasp, and simple to memorize. 
By this design, such systems can be viewed as communication tools. In fact, 
the “librarian” Melvin Dewey used numbers to name the classes in his system 
and, thus, imaginated his Decimal Code (DDC) as a language-independent 
universal communication system, see, e.g., Dahlberg (1974). 

The early designers of classification systems in the 18th century also con- 
ceived knowledge as a new territory to be discovered. They wanted to pro- 
vide tools for navigation in this unknown landscape. (Notice the similarity 
to “navigation concepts” in the modern World Wide Web.) 

Darnton (1998) states that Mappemonde was a metaphor central to Diderot 
and many other encyclopedists. Diderot viewed his Encyclopedie as a world 
map showing the connections and interdependencies among the most impor- 
tant countries. 

An elementary feature of a classification system is that it draws borders. It 
must do this in order to distinguish objects. But borders, wherever they are, 
are dangerous. They are subject to attacks. Here attacks come from different 
views or from new knowledge. If too many of the important borders go, a 
system disintegrates. This danger causes fear and, in turn, rigidity. Lynn 
Margulis, for instance, describes this phenomenon. She created a new theory 
of the origins and evolution of cells, summarized in her book Symbiosis in 
Cell Evolution. Her analysis strongly impacts on biological taxonomy and 
systematics. She describes in her paper (Margulis (1995)) the rigidity of the 
establishment that was very reluctant to accept the new view. Among many 
other obstacles she lists: “. . . a school or publisher would have to change its 
catalog. A supplier has to relabel all its drawers and cabinets. Departments 
must reorganize their budget items, and NASA, . . . , and various museums 
have to change staff titles and program-planning committees. The change 
. . . has such a profound implication . . . that resistance to accept it abounds 
.... It is far easier to stay with obsolete intellectual categories. ” 
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Margulis’ remarks apply in general. Scientific progress is a danger for every 
classification system. 

Thus, whatever genius is used for their design, classification systems are 
limited in range and in time, often inconsistent and illogical and, partly, even 
contradictory or paradox. This holds, in particular, if ad-hoc adaptations 
are made to incorporate new developments. This is one of the reasons for 
Daston’s statement: ^^Classifications orqanize, but they are not orqanic. 
Daston (1997). 

For instance, biology before Darwin was organized according to the “rule 
of five”, which was supported by many leading scientists of this time, see 
Gould (1985). It is almost unimaginable that a reader of our time could 
“believe” in such a system. It took a revolution to change this view. 

Why have we told these stories about traditional classification systems and 
their disadvantages? Well, because one can observe that history seems to 
repeat itself in the world of electronic information. With the advent of pow- 
erful computers, cheap storage devices, and fast networks (in short: with the 
rise of the Internet) a flood of electronic information appeared. Catchwords 
such as “information society” were quickly coined pointing at the fact that 
in the electronic world digital information is accessible from everywhere, by 
everybody, at any time. However, it was soon realized that Internet informa- 
tion is “chaotic” . To cure this desease classification systems came up quickly. 
According to Koch (1997) there are a number of classification mechanisms 
operating in the World Wide Web. They provide for 

• Support for navigation by 

- structuring for browsing (e.g., WWW Virtual Library)^ 

- setting context for searches [Scorpion)^ 

- broadening/narrowing searches {Yahoo f), 

- help to master large databases {MSC index). 

• They also offer support for communication, by 

- organizing large sets of electronic discussions (UseNet News), 

- presenting/accessing knowledge in a common (uniform) way, 

- allow for interoperability of databases (“crosswalks”), 

- stabilize contexts and conceptual schemes for distributed user 
communities in networks. 

In the electronic world classification systems suffer from the same deficiencies 
as they do in the traditional world. 

In fact, some of these deficiencies become even more visible. It is generally 
agreed that, fostered by the electronic revolution, progress in the sciences 
and the production of information is becoming faster and faster. Rapid 
changes make the rigidity, fixed granularity, and slow adaptivity of classifi- 
cation systems very apparent. 

On the other hand the electronic world offers new opportunities that signifi- 
cantly enlarge the power of traditional navigation provided by classification 
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systems. For instance, modern hyptertext systems in the World Wide Web 
offer graphical and spatial navigation by means of 

• Interactive maps without any limit on the depth of nesting (e.g., Vir- 
tual Touristy CityNet) 

• Combination of pictures from the earth (or even space) with geospatial 
coordinates {EarthView, Living Earth) 

• a list of icons to select from collections of video clips {Cine Base Video 
Server) 

• two-dimensionally arranged collections of icons representing maps pro- 
ducing, when selected, regularly updated geospatial information such 
as weather forecasts, temperature maps, etc. {Blue-Skies Weather 
Maps) 

An ever growing spectrum of alternative navigational paradigms is getting 
into common use today, such as 

• Navigation by historic terms, chronologies or history maps {History of 
Mathematics from Mac Tutor^ Chronology of Mathematicians) 

• Navigation by theory, e.g., by mathematical expressions {Famous Cur- 
ves Index from Mac Tutor) 

We expect that the full spectrum of document/resource description and re- 
lated navigational facilities - as they are in use already in modern hypertext 
systems, like Hyperwave, see Maurer (1996) - will come to the Web with the 
new extended markup language XML, which is based on SGML and sup- 
ported by the World Wide Web Consortium W3C, see Mace et al. (1998). 

2 To Index or to Classify 

One of the severe drawbacks of classification systems is that they have to be 
supported by manpower. A group of persons must agree on the interpreta- 
tion of the items of the scheme. Information must then be processed by a 
person who evaluates the contents, selects key words, and assigns the objects 
to appropriate classes of the scheme. Thus, such systems have limits due to 
the availability of time, manpower, and financial means. In general, classi- 
fication systems cannot keep up with the growth of knowledge; in modern 
terms, they don’t scale. 

In fact, d’Alembert did not only observe this phenomenon, he also argued, as 
mentioned before, that the adaptation of a classification system is outpaced 
by the growth of information. Therefore, d’Alembert and Diderot decided 
against classfication and arranged their Encyclopedic lexicographically, i.e., 
they decided to index and not to classify. 

Those groups of people aiming at making web information more accessible 
were confronted with the same problems as d’Alembert, however, at a much 
larger scale. 
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The first efforts to organize web information were based on alphabetic lists 
of web resources; the most prominent example is the WWW virtual library. 
Although this is a very valuable resource locator and although its mainte- 
nance is distributed on many shoulders it cannot keep up with the growth of 
the web. In fact, similar endeavours, such as the Geneva’s University GUI 
catalogue, stopped their service. Considering these difficulties it is appar- 
ent that one has to look for “automatic solutions”. Considerations of this 
kind gave birth to “search engines” . One prominent example, among many 
others, is Alta Vista. The goal of the designers of Alta Vista was to index 
the contents of the whole (accessible) web. The only way to establish and 
maintain such an index at acceptable costs is to use so-called “robots”, i.e., 
programs that traverse all available web resources (by following all the links 
they can find), extract, index, and rank all “relevant” information they can 
find and that concentrate the results into one huge data base served by very 
powerful computers. In the beginning of these projects it was quite unclear 
whether search engines would be able to achieve their ambitious goals. To- 
day they have become a tremendous success. Basically everybody using the 
Internet employs search engines to find information. 

A different path was taken by the designers of the (now) commercially very 
successful Yahoo!. Yahoo! has developed its own classification system (they 
call it ontology) and employs a group of classifiers (currently about 20) who 
select information resources that they view valuable for the “Yahoo! cus- 
tomers” . These resources are classified (the staff writes short descriptions) 
and integrated into the Yahoo! scheme. Whatever is contained in the Ya- 
hoo! system can be retrieved using a specific search engine. Thus, Yahoo! 
is a combination of a standard (but new) classification system that is based 
on handcraft (evaluating/ranking by a group of experienced classifiers) with 
modern electronic retrieval tools. The limitations on manpower force con- 
centration on special topics and strict selection. What may seem weakness 
has turned into strength since the customer of Yahoo! is sure to obtain 
information assessed by competent persons. 

The obvious question is: Can’t one replace the experienced classifiers by 
automatic evaluation systems? This question has been asked more than 
thirty years ago and gave rise to the theory of information retrieveal. Among 
the key terms in this theory are “precision” and “recall”. {Precision is 
the number of retrieved and relevant items divided by the number of all 
retrieved items. Recall is the number of retrieved and relevant items divided 
by the number of all relevant items.) It seems that these terms provide good 
tools to describe the quality of the answers an information retrieval system 
gives. However, both definitions contain the term “relevant” , which is not a 
technical term but a “concept of mind” . Confronted with a larger collection 
of, e.g., scientific papers even a specialist does not know which papers are 
“relevant” for him. How should a machine do so? Thus, there are small 
chances to get help from computers in analyzing “relevance” in document 
collections of significant size. Furthermore, what is relevant may change over 
time or after having obtained new information. 
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Blair (1990) discusses in depth why relevance is difficult to ascertain, and 
that measuring the success of retrieval results is difficult and very costly. 
He argues that this is almost impossible for large databases. Research in 
information retrieval stalled about fifteen years ago. However, interest in 
information retrieval techniques is rising again. This rise is not only fos- 
tered by the appearance of search engines and the growth of the Internet, 
but also by the world of “large-scale research”. For instance, in the Hu- 
man Genome Project, massive sets of data are produced (by making auto- 
mated experiments and automatically measuring the results) that must be 
recorded, linked to and combined with other data. These data must be clas- 
sified along the prevailing theories and connected to associated publications. 
Furthermore, statistics, visualizations, etc. have to be produced. 

While traditional classification and information retrieval systems are based 
on a linguistic approach (organizing knowledge, uniform view, universal re- 
quirements, and document retrieval) the new demands focus on pragmatic 
topics (organizing documents, user-specific needs, adaptable views, and data 
retrieval). In fact, this very same change of view was the incitement for the 
digital library projects in the United States and the United Kingdom. 

The concept of metadata that we will discuss in the next section arose within 
these digital library projects. 



3 Towards Resource Discovery in Networks 

With the advent of the Datenautobahn^ the Internet and its large and wide- 
spread digital resources we are confronted with yet another order of complex- 
ity. The big Internet archives not only contain text material (like preprints 
and electronic books), they also include images, maps and geospatial data, 
videos and computer vision material, environmental and agricultural data- 
bases, vast arrays of governmental and statistical data, pictures from the 
universe, etc. 

Right here, on the information highway, library science, communication, 
and computing are merging. Thus, it is no wonder that the origins of meta- 
data are rooted in the digital library projects supported by NSF, NASA, 
and ARPA. Nevertheless, a well known document deliverer, OCLC, and a 
Supercomputing Center, NCSA, have started the first concrete “universal” 
metadata activity. They perceived the Dublin Core, see Weibel (1995) for a 
report on the first workshop in Dublin, Ohio, where the term Dublin Core 
was coined. Later, UKOLN, the UK Office for Library and Information Net- 
working of Great Britain, and many other working groups, user communities 
and organizations joined the project, e.g., national libraries, museums and 
institutions of cultural heritage. 

A new (very pragmatic) paradigm came up with this movement: usability 
and utilization, instead of knowledge ordering and information retrieval. To 
say it short: the focus is on data - not on knowledge and not on information. 
And, there are “data about data” from now on called metadata, conceived as 
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“information that makes data useful” . This concept is centered around the 
user and his needs. There is another shift. The user is not only a consumer 
who wants to discover resources in the Internet, the user also offers his 
resources and is asked to do so. 

3.1 Dublin Core, Issues and Problems 

The Dublin Core is still in active development. In the beginning (spring 
1995), as Weibel et al. formulated in the OCLC/NCSA Metadata Workshop 
Report, “T/ie discussion was . . . restricted to the metadata elements for the 
discovery of what we called document like objects, or DLO’s by the workshop 
participants”. The Internet was considered chaotic and the proposed solution 
was to provide authors of Internet resources with metadata techniques from 
the library and information sciences. They should, however, be easier to 
use. The initial aims of the designers were very ambitious. The Dublin Core 
elements should guarantee: 

• Intrinsicality, 

• Extensibility, 

• Syntax independence, 

• Optionality, 

• Repeatability, 

• Modifiability. 

Content Intellectual Instantiation 

Property 



1. 


Title 


2. 


Author or 


7. 


Date 


3. 


Subject and 




Creator 


8. 


Resource 




Keywords 


5. 


Publisher 




Type 


4. 


Description 


6. 


Other 


9. 


Format 


11. 


Source 




Contributor 


10. 


Resource 


12. 


Language 


15. 


Rights 




Identifier 


13. 


Relation 




Management 






14. 


Coverage 











Table 1: The Dublin Core Metadata Element Set, according to DC-5 

At that time (and still today) a great number of different description schemes 
for resources were in use in different user communities, see Dempsey and 
Heery et al. (1997) and Heery (1996) for an overview of present metadata 
formats. The question was, do these have something in common that would 
help the authors of Web resources? The answer given at the first Dublin 
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Core Workshop (we will use the abbreviations DC and DC-1) was a core 
set of twelve elements, the elements numbered 1, . . . , 12 in Table 1. DC-3 
added three more elements. All these were grouped and finally named (the 
current usage is indicated in bold) at DC-5 in the way shown in Table 1 
(http : //purl . org/metadata/dublin_core_elements). 

In the meantime five major Dublin Core workshops took place, the last one, 
DC-5, in October 1997 in Helsinki. There was a shift in major principles as 
well as in the constitution of the DC community, both due to wide inter- 
national discussion and broad acceptance of the general idea. According to 
the DC-5 Report given by Weibel and Hakala (1998), the design principles 
are now 

• Simplicity, 

• Semantic Interoperabiltiy, 

• International Consensus, 

• Flexibility. 

The center of gravity in the activities of the DC community has changed from 
authors (laymen) to catalogers (information professionals). As Priscilla Ca- 
plan (1997) reports in the PACS-review: ‘‘Back in 1995 we focused on pro- 
viding authors with the ability to supply metadata as they mounted their own 
publications to the Web. This is happening, but not as much as we expected; 
most metadata is being created by catalogers, or information professionals 
we wouldn’t quite call catalogers, or by other non-authorial agents. ” 

Today, technical rather than conceptual questions have moved into the focus 
of the dicussions, e.g., integration of heterogeneous databases, interoperabil- 
ity of digital libraries, combinations of digital resources, and wide acces- 
sibility of catalog information. The DC community encompasses a broad 
spectrum of groups from libraries, museums, archives, documentation cen- 
ters, both public and commercial, and also a number of scientific groups, 
e.g., from mathematics, astronomy, geology, and ecology. These groups have 
agreed on extending and applying the DC metadata principles to non-textual 
objects (e.g., scanned images, digitized music, and videoclips) and also to 
non- Web objects such as entries of library OPACs, catalog information on 
visual arts and historic artefacts, which cannot be scanned at all. 

In spite of these substantial extensions in aims and targets the Dublin Core 
remained simple. It is a conceptual scheme which - free from the peculiarities 
of syntax and implementation - can be described by no more than three 
typewritten pages. The DC community strongly supports implementation 
projects in the World Wide Web (based on HTML) and in the Z39.50- 
oriented database community. The Helsinki Workshop web pages list about 
30 major implementation projects (http://linnea.helsinki.fi/meta/), 
e.g., the Math-Net project of mathematics in Germany. 

The DC community gained momentum through its ability to integrate a vari- 
ety of scientists and cataloging people, who are creating their own metadata 
methodologies according to their special habits, needs and uses, and who are 
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increasingly realizing that information exchange between the sciences and 
society is becoming more and more essential. For them, the Dublin Core 
provides a “window to the world” . 

This picture describes the situation quite precisely. One can view the world 
of information as a set of many rooms containing massive sets of hetero- 
geneous data, in general not accessible by inhabitants of other rooms. DC 
metadata provide a uniform interface through which search engines, robots, 
etc. can collect information about the data in other rooms. The search 
engines etc. obtain the ability to gather and process the attributes and 
allow the viewer to inspect them in an integrated environment. This sup- 
ports inter- and transdisciplinary information exchange far beyond what 
traditional libraries can offer. This is in line with the growing trend in the 
sciences to present results to the general public. 

3.2 What the Dublin Core Cannot Do, and Ways Out 

Bibliographic cataloging, in a few words, consists of a set of rules by which 
information about a book can be reduced to a catalogue card in a systematic 
way. To make this work, throughout the world, rule systems, such as RAK 
in Germany or AACR in the United States, have been designed that provide 
very good guidelines for the professional cataloger but are far too complex 
for the “educated layman” . One of the ideas behind the Dublin Core was to 
extract the “best of this world” so that authors of web objects can describe 
their products without professional aid. The products for which the Dublin 
Core has been designed are what was called “document-like objects” , which 
may be everything that is stored electronically in the web, e.g., electronic 
versions of books and journals, digital maps, sources of programs, geospatial 
or medical data, etc. 

Of course, the communities working with computer programs, medical data, 
etc. have also developed their own description schemes, and they use differ- 
ently constructed data bases. It was, thus, another aim to formulate the DC 
concept in such a way that also the central attributes of these descriptions 
can be included. Interoperability of the respective technical system was a 
main goal. 

If you are determined to stay “simple and universal”, as the Dublin Core 
does, you cannot describe everything. Moreover, in the implementation 
process there is no way to escape from specifying details. This was also 
apparent to the DC designers. At the second workshop, DC-2 in Warwick, 
UK, a conceptual framework, called the Warwick container architecture, 
was created, which allows to support and enrich the Dublin Core by sets of 
additional description elements, see Weibel and Hakala (1998) for a review 
of DC-1 to DC-5. A detailed development of the Warwick framework has, 
however, not happened yet. The group working on this issue has joined 
its forces with another group working on a similar topic in developing the 
Resource Description Framework (RDF) for the World Wide Web. This 
is based on XML (extended Markup Language) that is viewed, by almost 
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everyone involved, as one of the future web languages. The DC community 
also made other steps to integrate large potential user groups. 

A surprising result of DC-3, the CNI/OCLC Image metadata workshop, 
which took place in Dublin, Ohio, in September 1996, was that now non- 
Web objects can also be treated adequately within the Dublin Core. As a 
consequence, the DC community receives useful support and criticism also 
from the (traditional) cataloging community. This would not have happened 
without the inclusion of the 15th element. Rights Management, because 
visual art and digitized images are often affected by copyright regulations, 
as are data bases with specialized information. 

The DC-3 workshop also made the limitations due to the restriction on only 
15 attributes of the DC concepts clearly visible. This led to some tension 
within the DC community, which were partially resolved at DC-4 in Can- 
berra, Australia, in March 1997. It was accepted as a solution that DC 
metadata should be enhanced by (at least) three qualifiers in order to get 
more expressive power. The so called 3 Canberra Qualifiers are: LANG (to 
characterize the language a specific metadata element is written in), TYPE 
in the meantime called SUBELEMENT (to specify subfields for greater pre- 
cision), and SCHEME (to specify a bibliographic scheme or international 
standard used). Each of these qualifiers is under development in different 
DC working groups. 

To give some examples, we will now discuss a few problems (out of a broad 
spectrum) that became visible through the experience of a number of im- 
plementation projects; for details see the extensive discussions in the DC 
meta2 mailing list (meta2@mmrl.ut.ac.uk). 

You will need the LANG qualifier in order to write the title element 

DC.Title = (LANG = . . . ) ... text . . . 
of a resource in any case you are not using the English language (which is 
the default) for the text. 

And you will need a subelement, e.g. 

DC. Title. Alternative = ... text . . . 



for any title other than the main title, where “title” is the name of the 
resource, usually given by the creator or publisher. 

The Creator (or Author) element of a resource is packed with problems once 
you start to think about it. You need (the help of) a scheme to write (and 
search for) it correctly. How would you code the name of the author? Which 
one of the following alternatives would be correct? 

DC. Creator = Grotschel, Prof. Dr. M. 

DC. Creator = Prof. Dr. Martin Grotschel 

DC. Creator = Martin Groetschel 

DC. Creator == Grötschel, M., Prof. 

You must write a name “correctly” , if you want to have reasonable alpha- 
betic lists, for instance. You will also need a coding convention for accents, 
umlauts, etc., e.g., to sort names consistently. Professional catalogers and 
librarians are using name authority files, such as the LCNAF (Library of 
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Congress Name Authority File) from the LOG; in Germany one would use 
PND, the PersonenNamenDatei, or GKD, the Gemeinsame Korperschafts- 
Datei. 

Apart from LCNAF, PND, etc. there are many other kinds of “controlled 
vocabularies” . If you think of subjects and keywords or classification codes; 
there are just a few: LCSH, MeSH, AAT, LOG, DDC, UDC, BC, NLM, 
MSG. 

You will need subelements also for proper discrimination in searches, e.g., 
in specifying for greater precision in search: 

DC. Creator = ... 

DC. Creator. PersonalName = ... 

DC. Creator. CorporateName = ... 

DC. Creator. PersonalName. Address = ... 

DC. Creator. CorporateName. Address = ... 

But who is the creator of a digitized painting by Picasso? Is it the person 
who digitized it (and put it in the Web) or is it Picasso? An answer to this 
question was given at DC-5 in Helsinki by means of the 



1:1 Principle: Each resource should have a distinct metadata description 
and each metadata description should include elements for a single 
resource. It is desirable to be able to link these descriptions in a 
coherent and consistent manner by usage of the RELATION element. 



The consequences of this decision are not yet fully understood. The relation 
field is under development and will go through some major evolution in the 
near future. At present about five major types of relations are discussed in 
the relation working group: 

1. Inclusion relation (e.g., collection, part of) 

2. Version relation (edition, draft) 

3. Mechanical relation (copy, mirror copy, format change) 

4. Reference relation (citation) 

5. Creative relation (translation, annotation) 

If all these problems connected with names are solved, then there remains the 
(what we call) “Tschebytscheff Problem”. As M. Hazewinkel (1998) pointed 
out on the occasion of a metadata workshop in Osnabriick, Germany, there 
are more than 600 variants of writing Tschebytscheff, the name of a famous 
mathematician, correctly. 

We agree with Mary Lynette Larsgaard (1997), a spatial-data cataloger at 
the Map and Imagery Laboratory, Davidson Library, University of Califor- 
nia, Santa Barbara: “Fu// cataloging is a complex^ time-consuming process. 
Library administrators, when they feel like being horrified, figure out how 
much time (and therefore money) it takes per title - around $ 67 per item. 
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at least at Davidson Library, . . . ” and ‘^There are many more possible meth- 
ods of access where full cataloging is used; the question is, how necessary are 
they? And the answer is, it depends. What are users looking for?” 

But, summarizing her experience in cataloging images from the Web using 
Dublin Core elements, she also states: ‘^The general experience in university 
libraries is that a brief record is sufficient, and indeed, this brief record is 
what normally displays in a library online catalog. Only the place of publi- 
cation does not appear in the Dublin Core element set. ” 

3.3 Metadata and Classification 

Classification systems can be categorized, according to T. Koch (1997) into 

• Universal schemes (e.g., LCC, DDC, UDC) 

• National general schemes (e.g., BC/PICA, RVK) 

• Subject-specific schemes (e.g., MSC, NLM) 

• Home grown schemes (e.g., Yahoo! ’s anthology) 

Universal schemes are ponderous, partly contradictory, and they are not well 
known to the scientist. Their advantage is their potential for “normalisa- 
tion”, e.g., by providing a framework for controlled keywords. In fact, this 
is their main use in universal libraries. Subject-specific schemes, in contrast, 
are in frequent use within certain scientific communities. They often utilize 
them for communication purposes. Subject-specific schemes, however, rarely 
transcend the borders of the specific community. National general schemes, 
on the other hand, are limited by their inherent range of acceptance. Home 
grown schemes, finally, may be accessed and used worldwide, but, unfortu- 
nately, they appear in general as the result of the activity of few persons or 
enterprises. Such schemes often disappear as soon as their creator gives up 
or lacks in commercial success. 

Suppose there would be a universally accepted classification scheme and 
there would be vast sets of resources, perfectly classified according to the 
scheme, then we would have reached the heaven of search and retrieval. 
By dynamically adjusting the granularity of our search we could easily find 
those documents that match our interest. The world has not reached this 
state. Neither do we have a generally accepted classification scheme nor is all 
relevant information classified. This, in particular, holds for web documents. 
We view metadata according to the Dublin Core scheme as a reasonable 
description of (web and non-web) documents and document-like objects. 
The Dublin Core elements constitute a conceptual description scheme that 
seems to form a good compromise between generality, precision, and sim- 
plicity. What is not so obvious is that it can also be used as a substitute for 
a universal classification scheme. In fact, special user groups can employ the 
Dublin Core to design and generate their own specific description systems. 
To achieve this goal we have to assume the existence of a search engine 
that “understands Dublin Core”, i.e., is able to restrict its search to the 
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Dublin Core elements and allows to target searches onto words that are 
used as significant terms. This would result in a considerable improvement 
in precision without resorting to (enormously resource and time consuming) 
full text search. It would be desirable for all search engines to allow this 
option. At present, only a few experimental search engines of this type are 
existent. 

If both, sufficiently many documents, described according to the Dublin Core 
scheme, and search engines understanding Dublin Core existed, the Web 
could be viewed as a “well-organized” global digital library. A key point here 
is that “well-organized” is not defined universally by some enlighted general 
committee; user groups (large or small) with certain common interests have 
to get together and to agree on their own standards, the usage of words, 
term hierarchies etc. to define what (within their local framework) well- 
organized is intended to mean. This results in a decentralized system where 
contradictions and conflicts may occur but that also has the potential to 
lead to globally accepted standards. We describe attempts of this type in 
the next section. 



3.4 Efforts of Scientific Societies 

In the early nineties the Bundesministerium fiir Forschung und Technologie 
(BMFT, now BMBF) supported projects by the Deutsche Mathematiker- 
Vereinigung (DMV) and Deutsche Physikalische Gesellschaft (DPG) to in- 
tensify the use of the mathematics and physics data bases at Fachinforma- 
tionszentrum Karlsruhe within the academic community. The participants 
of these projects soon realized that the use of some data bases is important 
but that the evolving Internet offers enormous potentials for electronic infor- 
mation, communication, publishing, etc. to support research and teaching. 
Moreover, it was obvious that most of the organization and planning to be 
done by the mathematics and physics societies is not subject specific. Since 
there were many overlaps and joint interests, it was decided to start a co- 
operative effort, called the “Gemeinsame Initiative der Fachgesellschaften 
zur elektronischen Information und Kommunikation” (short: luK Initia- 
tive), and to join forces. Starting with the leading scientific societies in 
mathematics, physics, chemistry, and computer science, a treaty was signed, 
committees were founded, etc. to push the use of the Internet forward, to 
improve the computing and network facilities within the universities, to de- 
velop information systems, and so on, to make the electronic resources of 
the Internet more accessible to scientists and students “at their workplace” . 
Even more important, all participating institutions and individuals were en- 
couraged to make their own electronic resources widely available and “in an 
organized way” . 

Within this cooperation there were both discipline oriented projects, such as 
MeDoc in computer science (supported by BMBF) or Math-Net in mathe- 
matics (supported by Deutsche Telekom and DFN), and joint projects such 
as the Dissertation Online (supported by DFG). The luK Initiative was in- 
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trumental in setting up GLOBAL-INFO, a BMBF-funded support program 
for global electronic and multimedia information systems. In all cases, em- 
phasis was laid on interoperability, joint interfaces, international standards, 
etc. in order to be able to gain from the work of others. 

At the same time the luK Initiative realized that it had to act internationally. 
E.g., similar activities have started in other countries or, subject specific, on 
an international level. It is important for the luK Initiative to coordinate its 
efforts with these activities to guarantee interoperability and provide mutual 
access to the respective information systems. 

The luK Initiative was successful in spawning many activities and projects, 
foster the development in other countries and internationally. Further lead- 
ing societies in Germany from biology, education, psychology, sociology, elec- 
trical engineering, joined the initiative. The luK Initiative cooperates with 
librarians, publishers, and other information providers. 

Everybody involved in the luK Initiative came to agree that, what is some- 
times called the “information chaos in the Internet”, has to be overcome, 
at least with respect to high quality scientific information. An important 
prerequesite for this is that information offered in the Internet is “well- 
structured”. This is, however, not sufficient if one has in mind to automat- 
ically collect information, e.g., by means of web robots. Gonsidering the 
amount of information offered in the Internet, automatic resource discovery 
is a must. This requires that resources are described by metadata. These 
metadata have to be produced manually, preferably by the authors who offer 
their resources. The metadata must be produced in a way that is under- 
standable by robots. This way the Dublin Gore came into play. The DC 
initiative has, just as the luK Initiative, broad transdisciplinary goals and is 
supported by a wide spectrum of scientists, catalogers, librarians, etc. This 
is why the luK Initiative decided to play an active role in the general de- 
velopment of the Dublin Core and to start implementing the concept, e.g., 
with the Math-Net project involving almost all mathematics departments 
and research institutes in Germany. (And this is also why the authors of 
this paper got interested in this topic.) 

3.5 Uses of Metadata: Final Remarks 

As mentioned before, the development of metadata is not at all finished, 
especially, the Dublin Core is still undergoing technical and conceptual re- 
visions. It is too early to judge whether this concept will be a success for 
the World Wide Web community as a whole. 

Let us repeat, the Dublin Core is a conceptual framework for metadata 
formats. Each group using Dublin Core must specify, for each Dublin Core 
element, how to fill and interpret it. E.g., the element DC. CREATOR can be 
specified in various ways. The metadata set for a preprint in the Math-Net 
project uses DC. CREATOR for the authors of a paper. More precisely, they 
use the subfield DC. Creator. PersonalName, where the name of the author 
has to be written in the form “last name, first name ...” (without title). 
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(The person inputing this data does not have to know technical details, such 
as coding a letter in HTML, since he only needs to fill out a simple form.) 
Other user groups employ DC. CREATOR (or subfields thereof) to specify 
the composer of a piece of music or a company owning a certain patent and 
will probably require different forms of notation. Catalogers who have been 
using bibliographic formats such as USMARC, MAB or the like are of course 
more familiar with such concepts. They typically define a “crosswalk” from 
their own data format to the elements of the Dublin Core by specifying, for 
each element, the related set of fields of their own format. 

This indicates the complexity of the process. There will always be different 
user groups having their own interpretations of the Dublin Core elements. 
However, all interpretations are using the same Dublin Core format, the 
fifteen Dublin Core elements. This way the Dublin Core provides a bridge 
connecting the resources of many user groups. This observation supports 
Priscilla Caplan’s view: 

^‘Now it appears an even more common application of DC is 
as “lingua franca^\ a least common denominator for indexing 
across heterogeneous databases. . . . the simplest way to index 
them all with some degree of semantic consistency may be to 
translate them all to DC. ” 

Dublin Core “is” in its range of elements a kind of “Inter-Meta-Data” . To 
a certain degree it makes the integration of heterogeneous collections of re- 
sources possible. This is and will be more important in the future because of 
two reasons: (1) Inter- and transdisciplinary research projects are increas- 
ingly common in modern science. (2) The research process of today results 
in a variety of products, rather heterogeneous in form and contents (e.g., 
articles, books, software, large data sets, videos, etc.). If the Dublin Core 
will be widely accepted, also a market of search engines may evolve on the 
basis of future WWW protocol suits and employing the DC as universal 
data structure. Users of such engines may have access to an ever growing 
number of heterogeneous and well structured digital resources. 

Our intention was to show where the Dublin Core and metadata described 
by Dublin Core elements can help organize knowledge that is reflected in 
data that are complex, heterogeneous, or interwoven. 

Let us conclude - in analogy to Bearman (1995) - with a list of items, where 
metadata are absolutely necessary: 

• Records with attributes of evidence: 

- Theses, dissertations, 

- Patents, authorship on new ideas, 

- Authentic art, orginality of work. 

• Unique artefacts or protected items: 

- Collections of museums (bones, stones, . . . ), 

- Historical books/documents (papyri, ancient bible, . . .), 




19 



- Lecture notes, scientific books, audio cassettes, videos. 

• Business environments, litigations: 

- Document management /delivery, 

- Online ordering. 

• Archival, collections of statistics data: 

- Government, statistical authorities, 

- Local administrations. 

In all of these categories the original data or artefacts cannot be “handed 
out” freely. In general, they must reside in an archive, a treasury, or a 
closed office, or in the depot, or a warehouse - until the moment where it is 
exposed, exchanged or sold. In all of these cases a certain substitute must 
exist which can be distributed freely or sent to a customer - instead of the 
original item. That is the purpose of all metadata. 
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Abstract: The problem of classifier combination is considered in the context of 
the two main fusion scenarios: fusion of opinions based on identical and on distinct 
representations. We show that in both cases (distinct and shared representations), 
the expert fusion involves the computation of a linear or nonlinear function of 
the a posteriori class probabilities estimated by the individual experts. Classifier 
combination can therefore be viewed in a unified way as a multistage classification 
process whereby the a posteriori class probabilities generated by the individual 
classifiers are considered as features for a second stage classification scheme 



1 Introduction 

The goal of multiple expert fusion is to improve the accuracy of pattern 
classification. Instead of relying on a single decision making scheme, several 
designs (experts) are used for decision making. By combining the opinions 
of the individual experts, a consensus decision is derived. Various classi- 
fier combination schemes have been devised and it has been experimentally 
demonstrated that some of them consistently outperform a single best clas- 
sifier. 

An interesting issue in the research concerning classifier ensembles is the 
way they are combined. For a recent review of the literature see Kittler et al 
(1996). Here we shall focus on strategies which use soft decision outputs of 
the individual experts, expressed in terms of class a posteriori probabilities 
estimated by the experts. From the point of view of their analysis, there 
are basically two classifier combination scenarios. In the first scenario, all 
the classifiers use the same representation of the input pattern. In this case 
each classifier, for a given input pattern, can be considered to produce an 
estimate of the same a posteriori class probability. 

In the second scenario each classifier uses its own representation of the input 
pattern. In other words, the measurements extracted from the pattern are 
unique to each classifier. An important application of combining classifiers 
in this scenario is the possibility to integrate physically different types of 
measurements / features. In this case it is no longer possible to consider the 
computed a posteriori probabilities to be estimates of the same functional 
value, as the classification systems operate in different measurement spaces. 
In this paper we review the theoretical framework for classifier combination 
approaches for these two scenarios. The main finding of the paper is that 

^This work was supported by the Engineering and Physical Sciences Research Council 
Grant GR/J89255. 
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in both cases (distinct and shared representations), the expert fusion in- 
volves the computation of a linear or nonlinear function of the a posteriori 
class probabilities estimated by the individual experts. Classifier combina- 
tion can therefore be viewed as a multistage classification process whereby 
the a posteriori class probabilities generated by the individual classifiers are 
considered as features for a second stage classification scheme. Most impor- 
tantly, when the linear or nonlinear combination functions are obtained by 
training, the distinctions between the two scenarios fade away and one can 
view classifier fusion in a unified way. This probably explains the success 
of many heuristic combination strategies that have been suggested in the 
literature without any concerns about the underlying theory. 

The paper is organised as follows. In Section 2 we discuss combination 
strategies for experts using independent (distinct) representations. In Sec- 
tion 3 we consider the effect of classifier combination for the case of shared 
(identical) representation. The findings of the two sections are discussed in 
Section 4. Finally, Section 5 offers a brief summary. 



2 Distinct Representations 

It has been observed that classifier combination is particularly effective if the 
individual classifiers employ different features ( see Xu et al (1992), Ho et al 
(1994)). Consider a pattern recognition problem where an object, event or 
phenomenon, perceived by humans as pattern Z, is to be assigned to one of 
the m possible classes {cji, ...., o;^}. Let us assume that we have R classifiers 
each representing this object, event or phenomenon referred to in general 
as an entity by a distinct measurement vector. Denote the measurement 
vector used by the i-th classifier by Xj. Thus Z and = l,..,i? are 
different representations of the same entity that we wish to recognise. In the 
following, we shall not distinguish between the entity and its human nervous 
system representation Z and will refer to it simply as pattern Z. 

In the measurement space of x^ each class Uk is modelled by the probability 
density function p{xi\uk) and its a priori probability of occurrence is denoted 
P{cuk). We shall consider the models to be mutually exclusive which means 
that only one model can be associated with each pattern. 

Now according to the Bayesian theory, given measurements Xi,i = l,...,i?, 
the pattern, Z, should be assigned to class cjj, i.e. its label 6 should assume 
value 9 — (jjj^ provided the aposteriori probability of that interpretation is 
maximum, i.e. 

assign 9 — > ujj if 

P{e = ujj\xi, ,Xfi) = maxP(0 = Wfc|xi, ,Xi^) (1) 

Assuming the measurement vectors are conditionally statistically indepen- 
dent the decision rule in (1) can be shown to simplify to 

assign 9 uj if 
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R R 

JJ P{e = Wj|xi)p(xi) = maxp(^'^)(a;fe) JJ P{6 = Wfc|xi)p(xj) 

2=1 2=1 

( 2 ) 

where p(xi) is the unconditional probability density at measurement Xj. The 
decision rule (2) quantifies the likelihood of a hypothesis by combining the a 
posteriori probabilities P{6 = cjk\^i) generated by the individual classifiers 
by means of a product rule. It is effectively a severe rule of fusing the 
classifier outputs as it is sufficient for a single recognition engine to inhibit 
a particular interpretation by outputting a close to zero probability for it. 
It has been shown in Kittler et al (1996) that under certain assumptions 
this severe rule can be developed into a benevolent information fusion rule 
which has the form of a sum. Benevolent fusion rules are less affected by one 
particular expert than severe rules. Thus even if the soft decision outputs 
of a few experts for a particular hypothesis are close to zero, the hypothesis 
may be accepted provided it receives a sufficient support from all the other 
experts. 

In particular, let us express the product of the a posteriori probabilities and 
mixture densities on the right hand side of (2) P{6 = ouk\^i)p{'x.i) as 

P{6 = Uk\yii)p{yii) = P{0 = ujk)pi{l + Ski) (3) 

where pi is a nominal reference value of the mixture density p(xi). A suitable 
choice of Pi is for instance pi = maxx-p(xi). Substituting (3) for the a 
posteriori probabilities in (2) and neglecting higher order terms we obtain a 
sum decision rule 

assign 9 — )> Uj if (1 — R)P(LOj) + ^ _ 

2=1 Pi 

^=1 Pi 

This approximation will be valid provided Ski satisfies \Ski\ « 1. It can be 
easily shown that this condition will be satisfied if P(o;^|xj)p(xi) /piP{uJi) — 1 
is small in absolute value sense. 

Before proceeding any further, it may be pertinent to ask, why we did not 
cancel out the unconditional probability density functions p{^i) from the 
decision rule (2). The main reason is that this term conveys very useful in- 
formation about the confidence of the classifier in the observation made. It 
is clear that a pattern representation for which the value of the probability 
density is very small for all the classes will be an outlier and should not be 
classified by the respective classifier. By retaining this information, in the 
case of the product rule (2) we have the option of suppressing the effect of 
outliers on the decision making process by setting the a posteriori probabili- 
ties for all the classes to a constant. In contrast, the sum information fusion 
rule will automatically control the infiuence of such outliers on the final de- 
cision. In other words, the classifier combination rule in (4) is a weighted 
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average rule where the weights reflect the confldence in the the soft decision 
values computed by the individual classiflers. Thus our decision rule (4) can 
be expressed as 

assign 6 uj if 



(1 - R)P{u>j) + ^u;(xj)P(wj|xi) = max[(l - R)P{uJk) + ^ -u;(xi)P(a;fc|xi)] 

i=l i=l 

( 5 ) 

The main practical difficulty with the weighted average classifier combiner 
as specified in (5) is that not all classifiers will have the inner capability 
to output such information. For instance, it would not be provided by 
a multilayer perceptron and many other classification methods. We shall 
therefore limit our objectives somewhat and identify the weights u;(xj) = Wi 
which will reflect the relative confldence in the classiflers in expectation. 
These can easily be found by an exhaustive search through the weight space. 
In practice the individual experts will not output the true a posteriori prob- 
abilities i = 1, but instead their estimates P{uJk\:x.i) where 

P(o;A:|xi) = P{(jJk\:sii) + e(xi) (6) 

and e(xj) is the estimation error. It has been shown in Kittler et al (1996) 
that the sensitivity to estimation errors of the product rule is much greater 
than the sum rule. 

It has been shown in Kittler (1996) that the decision rules (2) and (4) sim- 
plify to the commonly used combination strategies such as Product rule, 
Sum rule, Min rule. Max rule. Majority vote. 



3 Identical Representations 

In many situations we wish to combine the results of multiple classiflers 
which use an identical representation Xi = x, Vi for the input pattern Z. 
A typical example of this situation is a battery of k-NN classiflers which 
employ different numbers of nearest neighbours to reach a decision. Alter- 
natively, neural network classiflers trained with different initialisations or 
different training sets (Wolpert (1992), Cao et al (1995)) also fall into this 
category. The combination of ensembles of neural networks has been stud- 
ied in Hashem, Schmeiser (1995), Cho, Kim (1995) and Hansen, Salamon 
(1990). 

By means of classifier combination one is able to obtain a better estimate of 
the a posteriori class probabilities and in consequence a reduced classification 
error. A typical estimator is the averaging estimator 

j=l 



( 7 ) 
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where Pj{uji\x) is the a posteriori class probability estimate given pattern 
X, delivered by the j — th estimator and P{uji\x) is the combined estimate 
based on N observations. 

Assuming that the errors ej{uji\x) between the true class a posteriori prob- 
abilities P{(jOi\x) and their estimates are unbiased the combined estimate 
P{ui\x) will be an unbiased estimate of P{uJi\x). Suppose the standard de- 
viations aj{uji\x) of errors ej{cui\x) are equal, i.e. aj{cji\x) = a{x) ^i^j. 
Then, provided the errors ej{uji\x) are independent, the variance of the error 

distribution for the combined estimate d^(x) will be d^(x) = 

Now if the standard deviations aj{ui\x) of the errors are not identical, then 
the combined estimate should take that into account by weighting more the 
contributions of the estimates associated with a lower variance, i.e. 



P{ui\x) = 






N 

E 



-Pj{uji\y.) 



(8) 



Provided the errors are unbiased, and independent, the combined estimate 

1 



in (8) will also be unbiased and its variance cr?j(wi|x) will be 



d^(x) 



2^3 = 1 



(9) 






From (9) it can be seen that the variance of the error distribution of the 
combined estimator will be dominated by the low variance terms. 

The weighted estimator (8) represents a general case which may be written 
as 

N 

P(wj|x) = ^ray(x)Pj(o;i|x) (10) 

with the weights Wij(x) satisfying ^^^iWij{x) — 1. It will assume a spe- 
cific form in particular circumstances. For instance, if the properties of 
the individual estimators are class independent, the weights will satisfy 
Wij{x) = If, in addition, the variances of the error distributions 

of the individual estimators cr|(cj^|x) are independent of the position in the 
pattern space the weights will satisfy Wij{x) = Wj. It also subsumes the case 
when the variances are all identical with Wij{x) = 

Recall that when the respective variances of the individual estimators are 
known the weights can be determined using the formula 

1 

1 

If this information is not available, it may be possible to estimate the appro- 
priate weights so that the classification error obtained with the estimator in 



Wij{x) = 



(11) 
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(10) is minimised. In order to adopt this approach it will be necessary to 
have another independent set of training data. 

Note that the estimator (10) is defined as a linear combination of the individ- 
ual estimates. This immediately suggests that it may be possible to obtain 
even a better combined estimate of the class a posteriori probabilities by 
means of a nonlinear combination function as 

P{ui\x) = F(Pi(^i|x), , Piv(c^z|x)) (12) 

In fact estimators which aim to enhance their resilience to outliers by adopt- 
ing a rank order statistic such as the median, i.e. P(cjj|x) = med^^-^Pj{cUi\x) 
fall into this category. Such nonlinear estimators do not require any addi- 
tional training. However, if sufficient additional training data is available, 
a suitable nonlinear function may be found by means of general function 
approximation (i.e. neural network methodology) or by other design al- 
ternatives. The effective local variance of the resulting estimator could be 
estimated from the input variances by function linearising techniques. 

For discriminant function classifiers the benefit of combining multiple experts 
using an identical representation has been investigated by Turner, Ghosh 
(1996). They showed that the classifications error will be reduced as a 
result of the effective discriminant function of the combiner being closer to 
the Bayesian decision boundary. 

A linear combiner of classifier outputs has been applied to the problem of 
combining evidence in an automatic personal identity verification system in 
Kittler et al (1997c). The system fuses multiple instances of biometric data 
to improve performance. In this application a single classifier computes a 
posteriori class probabilities for several instances of input data over a short 
period of time which are then combined. For this reason an equal weight 
combination was appropriate. A combination strategy involving unequal 
weights has been used in Kittler et al (1997b) to fuse the a posteriori class 
probabilities of several classifiers employed in the detection of microcalcifi- 
cations in mammographic images. The weights were estimated by training. 
The combination of classifiers which produce statistically dependent outputs 
is discussed in Bishop (1995). The approach also leads to a linear combi- 
nation where the weights reflect the correlations between individual expert 
outputs. 



4 Discussion 

In practical situations one is also likely to face a problem where a part of the 
representation used by the respective experts is shared and a part is distinct. 
This problem has been considered in Kittler et al (1997a). 

All the combination strategies discussed can be viewed as a multistage pro- 
cess whereby the input data is used to compute the relevant a posteriori 
class probabilities which in turn are used as features in the next processing 
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stage. The problem is then to find class separating surfaces in this new fea- 
ture space. The sum rule and the averaging estimator and their weighted 
versions then implement linear separating boundaries in this space. The 
other combination strategies implement nonlinear boundaries. The idea can 
then be extended further and the problem of combination posed as one of 
training the second stage using these probabilities so as to minimise the 
recognition error. This is the approach adopted by various multistage com- 
bination strategies as exemplified by the behaviour knowledge space method 
of Huang, Suen (1995) and the technique in Wolpert (1992). 

When linear or nonlinear combination functions are acquired by means of 
training, there is a very little distinction between the two basic scenarios. 
Moreover, such solutions are able to handle the fusion of measurements which 
are not conditionally statistically independent. Consequently it is possible 
to view classifier combination in a unified way. This probably explains the 
successes achieved with heuristic combination schemes derived without any 
serious concerns about their theoretical legitimacy. 



5 Conclusions 

The problem of classifier combination was considered in the context of the 
two main fusion scenarios: fusion of opinions based on identical and on dis- 
tinct representations. The main finding of the paper is that in both cases 
(distinct and shared representations), the expert fusion involves the compu- 
tation of a linear or nonlinear function of the a posteriori class probabilities 
estimated by the individual experts. Classifier combination could therefore 
be viewed in a unified way as a multistage classification process whereby 
the a posteriori class probabilities generated by the individual classifiers are 
considered as features for a second stage classification scheme. 

References 

BISHOP, C. M. (1995): Neural networks for pattern recognition. Clarendon Press, 
Oxford. 

CAO, J., AHMADI, M. and SHRIDHAR, M. (1995): Recognition of handwritten 
numerals with multiple feature and multistage classifier. Pattern Recognition, 28, 
153-160. 

CHO, S.B. and KIM, J.H. (1995): Multiple network fusion using fuzzy logic. IEEE 
Transactions on Neural Networks, 6, 407-501. 

HANSEN, L.K. and SALAMON, P. (1990): Neural network ensembles. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 12, 993- 1001. 

HASHEM, S. and SCHMEISER, B. (1995): Improving model accuracy using 
optimal linear combinations of trained neural networks. IEEE Transactions on 
Neural Networks, 6, 792-794- 




28 



HO, T.K., HULL, J.J. and SHRIHARI, S.N. (1994): Decision combination in 
multiple classifier systems. IEEE Trans. Pattern Anal, and Machine Intel, 16, 
66-75. 

HUANG, T. S. and SUEN, C. Y. (1995): Combination of multiple experts for the 
recognition of unconstrained handwritten numerals. IEEE Trans Pattern Analysis 
and Machine Intelligence, 17, 90-94- 

KITTLER, J., HATEF, M. and DUIN, R.P.W. (1996): Combining classifiers. 
Proc. 13th Intern. Conf. on Pattern Recognition, Vienna, 1996, Vol. II, 897-901. 

KITTLER, J., HOJJATOLESLAMI, A. and WINDEATT, T. (1997a): Strategies 
for combining classifiers employing shared and distinct pattern representations. 
Pattern Recognition Letters, 18, 1373-1377 

KITTLER, J., HOJJATOLESLAMI, A. and WINDEATT, T. (1997b): Weighting 
factors in multiple expert fusion. Proc BMVC , Colchester, 41-50. 

KITTLER, J., MATAS, J., JONSSON, K. and RAMOS SANCHEZ, M.U. (1997c): 
Combining evidence in personal identity verification systems. Pattern Recognition 
Letters, 18, 845-852 . 

TUMER, K. and GHOSH, J. (1996): Analysis of Decision Boundaries in Linearly 
Combined Neural Classifiers. Pattern Recognition, 29, 341-348. 

WOLPERT, D.H. (1992): Stacked generalization. Neural Networks, 2, 241-260. 

XU, L., KRZYZAK, A. and SUEN, C.Y. (1992): Methods of combining multiple 
classifiers and their applications to handwriting recognition. IEEE Trans. SMC, 

22, 4I8-435. 




How To Make a Multimedia Textbook and 
How to Use It^ 

Thomas Ottmann, Matthias Will 

Inst it ut fiir Informatik, Universitat Freiburg 
e-mail: {ottmann, will}@informatik. uni- freiburg.de 

Abstract: Taking the example of an innovative textbook on algorithm design, 
we illustrate the problems to be faced in scientific electronic publishing within the 
context of open educational environments. The issues to be considered particu- 
larly relate to design and choice of media types and document types. In reviewing 
the textbook’s production process, we show that the problems to be faced are 
mostly due to a lack of tools to support the authors. 



1 Introduction 

In the last few years, especially within the context of the popularization of 
the World Wide Web, numerous efforts have been undertaken to distribute 
electronically available publications via networks. For many technical ar- 
eas, where documents are produced using text processing software, it seems 
reasonable not only to offer scientific publications in paperbound form, but 
also to make them available as online texts for quick dissemination and use 
among the scientific community. A further step is to systematically offer on- 
line publications via digital libraries, including journals, reports, textbooks 
and others, and to make them searchable via a standard network browser 
on the scientist’s desktop. While it is important to provide an appropriate 
framework for distributed access to heterogeneous documents and databases, 
it is an even greater challenge to produce high quality innovative multime- 
dia textbooks for open environments, which can easily be used in varying 
contexts and for different purposes. 

In university environments in particular, we find a long tradition of paper- 
bound documents being produced by academics who are experts regarding 
content, but not with respect to typography, layout, and systems technology. 
We observe that the same people nowadays have also started to produce their 
own multimedia documents with by far less satisfying results. In this paper, 
we reveal some of the reasons why it is inherently more difficult to produce a 
multimedia textbook than it is to produce its conventionally published coun- 
terpart. We will exemplify our considerations through experiences gained 
during the production process of a multimedia textbook on algorithm design 
(Ottmann (Ed.) (1998)) 

In areas with highly structured, deductive, and theoretical content, such as 
mathematics, theoretical physics and large parts of computer science, both 

^This work was supported by the German Ministry for Education, Science, Research 
and Technology as part of the MeDoc project (grant no. 08 C58 25) 




30 



lecturers and students have made the following experience: The step-by-step 
development of a chain of intrinsically involved arguments can much better 
be done by an oral presentation using blackboard and chalk or well prepared 
transparencies than in the form of a written text in a scientific paper or 
textbook. This observation particularly applies to the area of algorithm 
design and analysis. Hence, it was our aim to produce a new kind of a 
true multimedia textbook which combines the advantages of both media: 
The printed version allows easy access and convenient, eye-friendly reading, 
whereas the accompaning CD-ROM’s maintain as much as possible of well- 
prepared oral presentations of the book topics and links them with texts, 
animations and simulations for offline use and self-paced study. 

A true multimedia publication integrates discrete and static media, which 
can also be printed with continuous media, which can only be presented using 
a computer. Our aim was to produce such a publication with minimal effort. 
However, this goal is much easier stated than achieved. We first discuss 
the inherent difficulties in the production process of multimedia documents 
in general before illustrating them through our specific experiences gained 
during the last year. 



2 Design Considerations 

It is well-known that the visualization of dynamic processes and the computer- 
based modeling of scientific experiments is a tedious, labour-intensive, and 
time consuming task. Therefore, it is desirable to be able to re-use pre- 
existing modules, either self-developed or third-party, and to integrate them 
into a piece of courseware. However, this is only possible in the context of 
open environments which minimize the constraints on combining heteroge- 
neous media formats. By using a modular approach, individual components 
may also be re-used in other contexts. 

A printed publication consists of chapters, sections and paragraphs. Its log- 
ical structure is hierarchical and can be visualized as a tree. As opposed to 
this, the relationships between the individual parts of a multimedia product 
may be arbitrarily complex, thus yielding a directed graph. The logical, 
tree-like structure of a “traditional” publication is linearized by generating 
the hardcopy, where the table of contents illustrates the logical structure to 
the reader to support his orientation. In electronic publishing, the relations 
between individual modules are modeled with hyperlinks, whose main role 
is, in analogy, to ensure the user’s orientation. Structural references result 
from the publication’s inherent structure, while logical references emerge 
from the logical interdependencies of the individual modules. While hyper- 
links may be helpful in associating specific parts of a publication with each 
other, they may lead to disorientation. Graph visualization techniques, rang- 
ing from context-sensitive menus to fish-eye views (Greenberg et al. (1996)) 
and three-dimensional representations (Robertson et al. (1993)) help the 
user in constructing a mental model of the structural and logical interde- 




31 



pendencies, but as these become more complex, the user’s cognitive abilities 
are exceeded. Therefore, we focused mainly on modeling structural links 
(see also section 4.4). 



3 Media Types and Document Formats 

The decision to use specific media types in an electronic publication con- 
strains the set of usable document formats, but does not determine it. As 
an example, the video clips may exist as QuickTime or MPEG movie files, 
or simulations may be implemented in C or Java. A general recommenda- 
tion is to use platform-independent software, or document types for which 
appropriate viewers exist for multiple operating systems. In this section, we 
first propose a general classification of document formats. As textual doc- 
uments are essential for knowledge transfer (see section 3.1), the problems 
in transferring documents to a suitable format for electronic publishing are 
exemplified for the case of static text. In the last paragraph, we discuss the 
issue of recording telepresentations, since they form an integral part of the 
multimedia textbook whose production process is described in more detail 
in section 4. 



3.1 Primary and Secondary Document Formats 

We call media types directly generated as output from a specific software tool 
or manipulated with an editor primary formats. All other media types can 
be assumed to be generated by processing primary formats, or through the 
use of export filters in standard tools, e. g. PostScript or image filters. These 
are called secondary formats and are characterized by not being directly 
generated. In many cases, primary document formats are proprietary, such 
as Word, Excel, or PageMaker, and therefore cannot directly be created with 
standard editors, but require special tools, e. g. word processors or drawing 
tools. As opposed to this, secondary media formats result from a conversion 
process from a source format to a destination format^ as in the case of a 
KT]eX document being transferred to PostScript for printing, or a Word 
document being exported as an HTML file. The source format may either 
be primary (Word, XFig), or, in the case of repeated post-processing, 

secondary, such as a PostScript document originating from a DVI file which 
is being converted to PDF. 

Since automatic conversion of large documents without manual postprocess- 
ing is not feasible, it is crucial to supply the authors with specific guidelines 
on document styles and additional mark-up which may be transferred to 
the publication format to automatically generate navigational support, such 
as typed hyperlinks. This would save the producer of having to manually 
adjust changed references in the case of source document editing. Most im- 
portantly, the individual modules should be delivered to the producer as 
final documents to prevent repeated manual post-processing. 
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Note that this distinction between primary and secondary document formats 
is related to the process through which a file was generated, e. g. an HTML 
file may be in primary format if it was directly produced through the use of 
an editor, but it may also have been created via an export filter, and thus 
be a secondary format. 

The advantage of primary formats lies in the fact that by editing and saving 
a document, its current version is always directly available, whereas one or 
more conversion steps are necessary in order to generate other formats from 
the current version, which may involve a loss of quality. Consider for exam- 
ple the process of converting an HTML document with logical markup into 
a PostScript document. As a result, the logical structure of the document 
is lost. Furthermore, manual post-processing may be necessary in order to 
generate an acceptable document version in a secondary format, such as in 
the case of a long PDF document which is enhanced by adding bookmarks 
in order to ensure the reader’s orientation. Thus, at first glance, primary 
formats are preferred when compiling multimedia courseware. However, ex- 
periences with library projects such as MeDoc (Will (1997)) or Liberation 
(Stubenrauch et al. (1999)) show that the available primary formats used by 
authors are mostly not suitable for digital publishing. Therefore, a tedious 
and time-consuming conversion of documents is mandatory, followed by a 
post-processing to account for a potential loss of quality, or to optimize the 
converted document. 

3.2 Preparing Text Documents for Digital Publishing 

As we have seen, the use of textual material is crucial for any knowledge 
transfer. In (Will (1997)), a number of criteria for hypertext formats were 
developed, most importantly availability for multiple platforms, typogra- 
phy, cross-referencing, document structure, and navigation. Additionally, 
the characteristics of HTML and PDF, currently being the only suitable 
alternatives for text formats in digital publishing, were sketched. 

HTML is widely known as the native language of the World Wide Web. It 
is a primary format if directly produced (e. g. through the use of HTML 
editors) in order to generate information pages. However, most authors 
use word processing software that does not directly produce HTML output. 
Thus, in the context of scientific publishing, it is used as a secondary format. 
In particular, even if an export filter producing HTML code is available, the 
result does not suffice for direct incorporation into multimedia publications, 
e. g. if complex typography (as in mathematical formulae) is needed. 

The Portable Document Format (Bienz, Cohn (1993)) is usually a secondary 
format and thus not directly generated. Also, software offering an export fil- 
ter for PDF, such as some word processors, does not offer the editing features 
required to produce a hyperlinked document for digital publishing. There- 
fore, even though editing tools for PDF documents are available, manual 
post-processing is required in most cases to achieve a satisfying presenta- 
tion, e. g. by adding navigational aids. While it is theoretically possible 
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to post-process PDF documents using a standard editor, difficulties arise in 
the case of updating texts, for example the adjustment of hyperlinks. The 
reason is that links in PDF documents are stored as geometric coordinates 
instead of being associated with the underlying text or graphic element^. 

In short, we note that the primary (authoring) format of a publication is 
generally different from its final (secondary) format and that crucial re- 
quirements, such as hyperlinking, are not met by the authors’ tools. Hence, 
serious problems arise both when converting documents which were not orig- 
inally intended for electronic publishing, and when the authoring format dif- 
fers from the publishing format. The latter applies in almost all cases except 
for that of authors who directly write their text in HTML format. 

3.3 Recording of Telepresentations 

As mentioned in section 2, one aim in producing the multimedia textbook 
was to enhance the printed documents by recorded lectures, delivered by 
experts on the individual chapters’ topics. However, conventional record- 
ing through the use of analog media (video or film) does not match well 
with digital media. Additionally, the enormous bandwidth required for dis- 
seminating digital video makes it a questionable option for use within the 
context of Internet services^. Even though streaming technologies (Real Au- 
dio, Real Video) may in theory seem to open new avenues, the quality of 
the data streams as received by the client may be very low, and the network 
load is tremendous. 

Alternatively, commands executed at the protocol level may be recorded, 
as done by the Mbone VCR (Holfelder (1997)), which records multicast 
video conferences, or by the Lotus ScreenCam software, which monitors and 
stores sequences of operating system commands. While this approach saves 
storage space, the recording level is generally too low to allow for extensions 
of pure playback, e. g. random access of multiple data streams. Another 
disadvantage is its close relation to a specific operating system protocol. 

A solution is the recording of the data streams generated during a lecture on 
a carefully chosen symbolic level, as in the Authoring on the Fly approach 
(Bacher, Ottmann (1996)). More specifically, the actions generated by the 
lecturer annotating his slides on the whiteboard are stored in an object file, 
which contains not only the parameters of the graphical objects, such as 
polygons, circles and rectangles being drawn on the whiteboard, but also 
the file names of the slides and the external applications being used. At the 
same time, an event queue is written, which associates time stamps with in- 
dices of objects stored in the object file. The time stamps can be associated 
with specific fragments of the audio file, containing the lecturer’s speech, 

^The PDFIL^project offers a solution to directly transfer documents in into 

fully hyperlinked PDF format. This requires the inclusion of all hypertext-specific infor- 
mation as macros in the source files. 

^By splitting large video documents, offline retrieval may be feasible, but browsing 
through large document repositories results in heavy network load. 
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such that a replay of the lecture allowing random access and visible scrolling 
becomes possible. Our experience shows that the lecturer’s video {talking 
head) does not add much to the comprehension of the lecture. Furthermore, 
a considerable bandwith is required. Therefore, it was not included in the 
final document, and a picture of the lecturer is displayed during replay in- 
stead. The resulting multimedia document can be integrated into various 
contexts, namely CD-ROM products for offline use, and local networks. In 
the latter case a Web interface is provided, where an ordered list of subtopics 
is automatically generated with one entry per slide change, including access 
to externally launched applications. Thus, random access to Authoring on 
the Fly (AOF) documents can be controlled both externally, e. g. from 
HTML documents, and internally, via the replaying software. 



4 Producing a Multimedia Textbook 

This section describes the production process of the multimedia textbook 
on algorith design (Ottmann (Ed.) (1998)), as illustrated in figure 1. It 
consists of eleven chapters and integrates discrete and continuous media. 
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Figure 1: Overview of the production process of the multimedia textbook 



4.1 Authors as Content Providers 

Each chapter consists of a scientific paper, with the same topic also being 
explained orally in a lecture. The primary format for the articles was 
which is commonly used by computer scientists. The lectures were delivered 
with the Mbone tool wb (Eriksson (1994)), which is the computer-based 
equivalent of a blackboard or an overhead projector. Before giving the talk. 
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the slides used during the lecture were edited using a variety of tools, such as 
SliTg^, PowerPoint or Showcase. These were then exported into PostScript 
format. Additionaly, the external applications used in some of the lectures 
were installed on the local computer, which required the writing of appropri- 
ate shell scripts to launch them from the computer whiteboard. In our case, 
external applications were either XTango animations, which had been com- 
piled into a program executable on the lecturer’s platform, or MPEG movies 
generated from individual screenshots or the Maple mathematics package^. 
The articles and animations were available both in primary and in secondary 
format, the former because the authors supplied us with their source files, in- 
cluding required style files and extensions, the latter because the animations 
had been developed locally using the XTango package (Stasko (1992)). 

4.2 Multimedia Reproduction of Online Lectures 

The lecturer first loads the slides and installs the applications. He then 
annotates and orally comments on the slides during the live lecture. All 
data streams (i. e. whiteboard actions, audio video) were recorded using 
our own software and simultaneously transmitted to remote locations, as 
the Mbone tools offer teleteaching facilities. In order for us to be able to 
do this, the slides and external applications also needed to be installed at 
the remote locations. Thus, only the transmission of slide changes and start 
and stop commands for the external applications was required. 

The recording yielded an audio file as well as an event queue and an object 
file, as described in section 3.3. The slides were automatically converted to 
GIF format and stored in the same directory as the other data, in particular 
the scripts to launch the applications and the applications themselves. In 
this way, instant replay of the lecture was possible. 

4.3 Post-Processing 

Besides developing and porting the AOF viewing software and post-editing 
the textual documents (see section 4.4), the individual modules had to be 
post-processed, in particular to edit the lectures, revise the texts and convert 
the documents to appropriate viewing format. 

The online lectures were processed with a custom editor (Maass (1996)), 
permitting the correction of minor mistakes such as slips of the tongue, 
pauses and unwanted whiteboard actions, such as scribbles, slide selections 
and launching of external applications, by cutting out unwanted portions 
of individual data streams. In order not to erase any actions performed 
concurrently with an undesired event, such as in the case of a drawing on 
the whiteboard occuring during a word repetition, whiteboard actions can 
be shifted in time. As speech, being a continuous data stream, is much 

^Any readable or executable document constrained to be dynamic would have been a 
candidate for inclusion as an external application to a lecture, including Java programs. 
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more subject to errors than any other data stream, such as whiteboard 
actions, which only occur at discrete points in time, the audio was taken as 
a reference. 

Revising the textual components meant not only proofreading the lecturers’ 
papers, but also adapting the layout to the publisher’s specifications. Finally, 
index terms were added to the source document to automatically generate an 
index for the printed version, an issue that only a few authors had thought 
of. These tasks required considerable editing of the source documents, which 
was not done by the authors of the original papers, at it should have been 
ideally. 

Several textual documents were added for didactic reasons and in order to 
ease navigation and orientation, namely 

• a short overview introducing the topic to be discussed 

• a lecture overview revealing its structure 

• an overview of the applications with reference to the chapters in which 
they occurred 

• a description of the animated algorithms 

• dead end pages^ 

In particular, since the lecturers did not provide any structuring information 
for their talks, it was necessary to guess a structure by listening to each of 
the lectures separately. 

Last, we mention the process of converting the individual modules to suitable 
media formats for digital publishing. In order to minimize the requirements 
for users, uniform document formats were used, namely PDF for texts and 
slides, GIF for images and Mpeg and QuickTime for movies. It is remark- 
able that even in the age of computerized typesettting systems the publisher 
required a hardcopy for direct photomechanical reproduction from the edi- 
tor. This seems still the only way to guarantee that the printing process does 
not falsify the original manuscript. For this reason, we chose PostScript as an 
intermediate printable format. As the modifications on the source document 
were numerous, it required repeated document conversion. In this context, 
it was advantageous to have the sources spread across several documents 
so that local modifications could be made without affecting the complete 
publication®. 



^Two CD-ROM disks were needed, as the audio data for the lectures exceeded the 
storage of a single disk. The access of a lecture was constrained to the respective overview 
page. The overview pages for lectures not present on the storage medium were substituted 
by a document instructing the user on the steps to be taken to view the lecture (i. e. how 
to exchange the CD-ROMs). 

®Note that this procedure is more or less applicable to all conversions from an authoring 
format (if not HTML) to a destination format suitable for electronic publishing (PDF or 
HTML), since most text formats do not permit the inclusion of hypertext features, notably 
hyperlinks, which then have to be added in a post-processing stage, as described in the 
next section. 
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As the Xtango animations require XI 1 libraries for execution, we could 
not include them as such for the Windows version. This problem was solved 
by converting them to QuickTime movies. Simulations, however, could not 
be transferred appropriately, as they involve user interaction. On the other 
hand, the effort to re-implement the simulations would not have been justi- 
fied. 



4.4 Document Integration 

As all the individual documents were available in their destination format, 
they were structured into a document tree, which would then be burnt onto 
CD-ROM after relating the documents to each other. This involved adding 
structural navigation aids to the textual documents in PDF format, such as 
links^ and bookmarks^. Additionally, it had to be ensured that all involved 
document types, in particular AOF documents, movies and animations, were 
correctly launched. For most PDF documents, especially the slides used in 
the lectures, so-called thumbnails^ were provided. Finally, a fulltext index 
was generated to make the publication searchable. 

Adding cross-document references was a tedious process, since none of the 
source documents contained suitable information to generate these refer- 
ences automatically, e. g. by embedding pdfmark (Bienz, Staas (1996)) into 
PostScript output. Additionally, a major problem was to reflect changes in 
the sources within the post-processed PDF documents, such as adjustments 
to links and bookmarks. 

For each slide change within a lecture, a file containing a timestamp (in 
milliseconds) had been generated. However, while associating the overview 
page entries with these starting points, it was necessary to adjust these 
timestamps so that upon clicking a subtopic, the lecture would start at the 
correct time when that topic was about to be discussed. In this context, any 
modifications to the AOF document due to post-editing would also result in 
the timestamps needing adjustment. 

In order to prevent the user from getting lost within the electronic docu- 
ment, the visualiziation of structural relations was focused on, namely by 
using context-sensitive bookmarks to ease navigation between and within 
chapters. Links were used to a much lesser extent, especially from within 
the overview pages (table of contents and lecture overview). Links were 
typified through the use of different colors to represent the document types 



'^More precisely, links can either be intra-modular references, meaning that source and 
destination anchor belong to the same module, or extra-modular references, with source 
and destination anchor within different modules, or the destination anchor being an entire 
document 

® Bookmarks are either inter-modular references, meaning that source and destination 
anchor are equivalent to an entire module, or intra-modular references, where the source 
anchor is equivalent to the whole document 

^Thumbnails are a small representation of document pages, which may appear simul- 
taneously with a particular document. 
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Figure 2: Screenshot from the electronic version of the multimedia textbook 



associated with destination links (red and black for texts, blue for anima- 
tions, cyan for simulations, and magenta for video clips). In addition, we 
used the PDF Notes mechanism^^ to provide literature references in order 
to prevent the user from moving to a different page or document. Also, the 
images contained within the contributed articles were made available as GIF 
documents and linked to appropriate references to figures within the text. 
This is useful because a reference is often located on a different page than the 
picture referred to (the destination anchor), but the user may want to con- 
tinue reading on the same page while viewing the image. Thus, by clicking 
on a reference to a picture, it is displayed externally within an image viewer, 
which allows for arbitrary resizing of the image. Finally, intra-document 
references were also made available as (extra-modular) links. 

Figure 2 shows a screenshot from a session with the electronic version of the 
textbook. 

4.5 Considerations for Multi-Platform Delivery 

While the majority of computer literates own or use a Windows-based PC, 
most computers at scientific institutions are Unix-based. Our aim was to 
produce a multimedia publication which would be usable in various context, 
in particular as a CD-ROM product for a standalone Windows PC, but also 
within local networks at academic institutions. Thus, our own software, in 
particular the tools to replay the online lectures, was implemented for Solaris 
(SunOS), IRIX (SGI), Linux, as well as for Windows95/WindowsNT. An 
implementation in Java was discarded because, the communication between 

^^Notes are visualized as small icons on top of a document page, which are opened as 
small windows containing arbitrary text 
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the various modules replaying the individual data streams would not have 
been fast enough. 

Our first approach was to deliver all necessary components, including soft- 
ware, animations and movies, on CD-ROM. However, this would have raised 
considerable problems because of the following reasons: 

• most software (shareware or freeware) is only licensed for personal use, 
and not for commercial distribution; 

• updates to the software would have required the sending of a new CD 
to all customers containing the most current version; 

• the supplied software would have needed to have been included for all 
supported operating systems, thus considerably augmenting storage 
space. 

Therefore, all platform-independent parts, i. e. all data, were separated 
from the platform-dependent modules, i. e. the software. More precisely, 
all textual files (including the fulltext index) and the video clips are found 
on two CD-ROMs, while the software, all simulations and animations and 
miscellaneous scripts to activate the data are found in a platform-dependent 
ftp archive. While the former is not likely to change, except if content-related 
errors are found, the software is being updated and improved regularly. As 
an additional benefit, a direct hotline to the developers (via an e-mail address 
and a form-based Web interface) is offered to the customer. The resulting 
advantages are noted: 

• it is possible to retrieve and use the latest software versions from the 
archives; 

• risks are reduced for the publishing house, since it is not liable for 
modules which are not distributed with the publication; 

• software bugs can quickly be eliminated and any other modifications 
or improvements result in a new version being offered; 

• other platforms may be supported in the future without affecting the 
CD-ROM shipped with the publication. 

Figure 3 illustrates the installation scenario, as encountered by the user. 
Although the required installation steps may seem unusual, most of the 
users are likely to be computer science students who usually have easy ac- 
cess to Internet services via their respective institutions. Thus, interested 
parties are assumed to be able to retrieve the necessary archive and to be 
knowledgeable enough to perform the necessary tasks required to install the 
electronic version of the publication on their local computer. 

With a few simple modifications, it is also possible to configure the publica- 
tion to be executable in local networks. In this case, the required software 
as well as the data for the online lectures resides in shared directories, while 
all textual documents and the destination anchors of the references to ex- 
ternal applications are stored within a Web server. Thus, the transfer of the 
non-static media is executed via the local network protocol, e. g. NFS, while 
the static media are transferred via the hypertext transfer protocol. 
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Figure 3: Using the electronic version of the multimedia textbook 

5 Conclusion and Future Work 



We have shown in this paper that a carefully developed multimedia publica- 
tion is more than just a combination of several media types. In particular, 
we see the role of new media in education as a complement to conventional 
educational scenarios, which are based on text and audio. These seem to 
be crucial for knowledge dissemination, and therefore any other media types 
are considered to be an optional supplement. On the other hand, consid- 
erations for electronic publishing have to be seen in the context of open 
environments: even if most end users are assumed to work on a PC, this is 
not the case at universities, where different platforms and operating systems 
are found. Another issue is re-usability of existing components and pub- 
licly available documents and applications in heterogeneous formats. Thus, 
the effort required to integrate individual components should not be over- 
looked. In this case, all the task necessary to post-process and integrate 
the individual modules can be assumed to have taken a man-year of work, 
resulting in the equivalent of 250 pages of scientific articles, 13 hours of 
recorded lectures, 23 animations, 2 simulations and 10 video clips. The ef- 
fort to prepare the lectures, write the individual papers and produce the 
external applications is not included in the time estimated to produce this 
publication. The considerable effort required to achieve the final product 
clearly show that powerful and easy-to-use authoring tools for the develop- 
ment of high-quality courseware to support university teaching are still not 
available for use in open environments. Therefore, we are investigating the 
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possibilities of multimedia authoring with the goal to embed all information 
relevant to the publication into the source documents, in order to avoid te- 
dious post-processing. Another issue is the possibility of active manipulation 
and customization of existing courseware, e. g. by providing suitable anno- 
tation mechanisms for generic document formats. These efforts are geared 
towards the development of widely acceptable tools capable of supporting 
authors, lecturers and students. 
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Abstract: This paper describes how clustering problems can be resolved by neu- 
ral network (NN) approaches such as Hopfield nets, multi-layer perceptrons, and 
Kohonen’s ’self-organizing maps’ (SOMs). We emphasize the close relationship 
between the NN approach and classical clustering methods. In particular, we 
show how SOMs are derived by stochastic approximation from a new continuous 
version {K- criterion) of a finite-sample clustering criterion proposed by Anouar 
et al. (1997). In this framework we determine the asymptotic behaviour of Ko- 
honen’s method, design a new finite-sample version of the SOM approach of the 
/c-means type, and propose various generalizations along the lines of classical ’re- 
gression clustering’, ’principal component clustering’, and ’maximum- likelihood 
clustering’. 

1 Introduction 

Artificial neural networks (NNs) are used in pattern recognition for con- 
structing discrimination rules which assign objects (data points) to one of 
several known and more or less specified classes {supervised learning). In 
contrast, the present paper deals NN approaches for the clustering problem 
{unsupervised learning) where we have a set O = n} with objects 

k — 1,2, ...,n described by multidimensional data points Xk € and we 
look for an ’adequate’ classification of these data, either in the form of a 
partition C = {Ci,...,Cm) of O into ’homogeneous’ classes Ci,...,Cm with 
suitable class representatives € RF, or in the form of a subdivision 

B = {Bi, Bm) of the whole feature space R^ into m domains Bi C RP 
which may correspond to m underlying populations (types) of objects. 

Clustering problems are often formalized by specifying a clustering crite- 
rion gn{C,Z) or g{B,Z) which is optimized with respect to the partitions 
C or B and all center systems Z = {zi,...,Zm) (see Bock 1974, 1996a,b,c, 
Spath 1985). Classical cluster analysis solves such optimization problems 
by relocation algorithms (such as fc-means), branch-and-bound methods or 
numerical iteration. If data points are sequentially observed the classical 
stochastic approximation approach provides an additional tool (Bock 1974, 
chap. 29). In fact, this sequential adaptation idea is commonly practized by 
the neural network community when designing ’learning algorithms’ which 
proceed typically by a sequential updating of states or parameters in a way 
which mimics the dynamic behaviour of mechanical, gravitational, or biolog- 
ical systems when approaching a stationary state with minimum ’energy’. 
In this paper we describe some NN approaches for clustering and propose 
some new criteria and methods in the framework of Kohonen maps. In sec- 
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tion 2 Hopfield networks are used for fuzzy classification. Section 3 describes 
a clustering model where a multi-layer perceptron has been used. By recall- 
ing some facts from classical cluster analysis and stochastic approximation 
section 4 sets the stage for the investigation of Kohonen maps in section 
5 and the formulation of a /c-means type alternative. Finally, in section 
6, we propose some new (sequential and /c-means type) generalizations of 
Kohonen’s approach in analogy to classical ’regression clustering’, ’principal 
component clustering’, and ’maximum-likelihood clustering’. 



2 Hopfield Networks for Clustering Purposes 

Hopfield networks are designed to minimize a smooth loss function L{vi ^ ..., 
vn) > 0 with respect to N real-valued variables ...jt’iv (each one repre- 
sented by a ’neuron’ or memory unit). The minimization algorithm proceeds 
in analogy to the motion of N particles which migrate in time t > 0 under a 
system of forces (Hamiltonian differential equations) and approach a steady 
state (tJ^, which provides a (global or local) minimum of the ’energy 

function’ L. 

More specifically, each variable Vi is modeled as a time function Vi{t) 
a{ai{t)) (position of the particle i, output of the ’neuron’ i) with an activa- 
tion function a^(t) and a prespecified smooth sigmoid function cr, e.g., the 
logistic function cr(a) = 1/(1 + e^)- The activations ai{t) evolve in time 
according to the system of differential equations: 

d 

ai{t) = - —L{vi{t),...,VN{t)) i = l,...,N, t>0. (2.1) 

It appears that L is a Lyapunov function for the system (2.1) which means 
that the loss function K{t) \= L{vi{t)^ ...^vjs[{t)) decreases in t along any so- 
lution of (2.1). This follows from K(t) — Ly.{vi{t), ...,t^^(t)) -Viit) < 0. 
Therefore the limits a* := lim^-^oo exist and the corresponding ’outputs’ 
vl := cr(a*) yield a (local or global) minimum L{v{^ ...^v*jq) of L. In prac- 
tice, the motion a^(t) of the particles is determined either analytically, by 
numerical methods for differential equations, or by simulation in continuous 
or discrete time in order to find the asymptotic values v*. (Note that an 
interpretion by a ’neural network’ is not really needed, but useful for illus- 
tration.) 

This optimization methods can be applied to a clustering problem when- 
ever this problem can be formulated as a minimization problem L — > min^ 
say, for a smooth clustering criterion L{vi, with real-valued variables 

v\^...^Vn which describe the unknown classification (e.g., class centers, fuzzy 
memberships etc.). Then we have to set up the differential equations (2.1) 
for the ai{t)^ resolve this system in order to determine the steady states a* 
and 1 ^*, and to re-interpret them in terms of an optimum classification of 
data. 
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Two special cases may illustrate the approach: We look for a fuzzy classifica- 
tion U — {uik)mxn of ^ objects where Uik G [0, 1] is the degree of membership 
of the A:-th object in the z-th fuzzy class Ui which leaves N = m-n variables 
Uik = '^ik = to be determined. A classical fuzzy clustering approach 

(Bock 1979, Bezdek 1981) would suggest to minimize a modified SSQ crite- 
rion: 



TTl Tl Tft Tt 

9i{U) := E E (^) = E E '\\xk — xui\^ min (2.2) 

2=1 k=l 2=1 k=l ^ 

with class centers xu^ := [E^r=i deviations Aik{U) ~ 
W^k - if/JP under the side constraints '^ik = 1 for all k, Kamgar-Parsi 
et al. (1990) replace these constraints by weighted penalty terms: 

• 92 {^) = Ylk=i[^i^i ^2fc — 1]^ for forcing the norming conditions Uik = 

1 , 

• 9s{^) •— Y^k=i[Yl^i UikUjk] for reducing the overlap between classes, 

• 9 a{U) := ^k=i S£i for forcing the Uik to be close to 0 or 

1 . 

and define the clustering criterion: 

L{U) := I • 9i{U) + I • 92{1^) + 2 ’ ^s(^) + ^4(^) iinn (2.3) 

with prespecified weights o:, /?, 7 > 0. The corresponding differential equa- 
tions: 



m m 

aik{t) = -aik{t) - I • Vik{t) ■ 5ik{t) - f • [E “ 2 ■ E 

2=1 j^i 

are solved (in a slightly modified form) by analog or discrete-time simula- 
tion. For t — > 00 , a stationary (sub-) optimum classification W = {u*k) is 
obtained. 

Adorf & Murtagh (1988) consider an n x n matrix D — (dki) of pairwise 
dissimilarities dki for n objects and look for a fuzzy m-partition U which 
minimizes: 

m n n o m m n n 

m := ^EEE dkl'^ik'^il r\ EEEE dkiUikUii 

i—\ k—\ /=! 2=1 k —1 1—1 

m n 

EEE ^k'^ik'^jk 

2=1 j‘7^2 k=l 
2 m m n n 

= -fEEEE Wikji ■ UikUji min . (2.5) 

^ i=i j=i k=i i=i “ 
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This criterion is a modification of the SSQ clustering criterion (4.1): The 
first term should guarantee a small heterogeneity in the classes, the second 
one a high separation among the classes, and the third term little over- 
lap for all objects k. The positive weights a,/?,7, are prespecified (typ- 
ically hk = 1). L{U) is a quadratic function in the Uik with coefficients 
Wik^ji ~ -a • dkiSij + p • 4/(1 - Sij) - jbk • (1 - Sij)5ki (and 4/ ^ {0, 1} the 
Kronecker symbol) which yields a linear system of differential equations: 

m n • -i 

a^k{t) = ^>0^ 1 = 1 “’n- M 

j=u=i 

As before, the limiting values a*^ for t -> oo provide a stationary or optimal 
fuzzy classification U* = = (cr(a*^)). 

Note that the Hopfield clustering approach resembles very much the classical 
gravitational clustering methods proposed by Butler (1969), Coleman (1971) 
and Wright (1977) (see also Bock 1998). 



3 Multilayer Perceptrons for Additive 
Clustering 

A Multilayer Perceptron (MLP) is designed to solve optimization prob- 
lems, e.g., to minimize an (often: quadratic) goodness-of-fit criterion. An 
MLP is constructed by concatenating several layers of elementary processing 
units i (’neurons’) where each one realizes a nonlinear function of the type 
V = cr{Ylj WjUj) of ’input’ values Ui, U2, ... with a sigmoid function a and co- 
efficients Wj (’weights’). Thus an MLP can be used for solving a clustering 
problem which is formulated as an optimization problem. 

A typical example is provided by Sato & Sato (1995, 1998) who start from 
an n X n matrix S = {ski)nxn of pairwise similarities Ski > 0 and look for an 
optimal fuzzy classification U = {uik)mxn of the underlying n objects with m 
classes. By generalizing the classical additive clustering model (ADCLUS) of 
Shepard & Arabie (1979), they assume that each pair of fuzzy classes C//, Uj 
has an unknown underlying similarity Wij and that the observed object by 
object similarities Ski can be modeled, up to random errors, by an additive 
superposition of the form Gki = E£:i Wijp{uik, up). Here p{uik, up) is a 
suitable aggregation function which weights the simultaneous appearance of 
the events ’A: G Ui with a degree and ’/ G Uj with a degree Uj/’ (typically, 
p{u^v) — u • v^ min{u,u}, uv/[l — (1 — u)(l — u)] etc.). The model a^i is 
adapted to the data Ski by minimizing the clustering criterion: 

G{U,W) := p^p^{ski-aki{U,W)Y ^ min (3.1) 

with respect to all fuzzy partitions U. Sato & Sato solve this problem with 
the help of an MLP. 
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4 A:-means Clustering and Stochastic Approx- 
imation 

Classical /c-means clustering is our starting point for describing a stochastic 
approximation approach and for applying both (!) to several extensions of 
Kohonen’s SOMs in sections 5 and 6. 

We want to partition n data points G (or the underlying set 

O = {1, ...,n} of objects) into a given number m of ’homogeneous’ classes 
C'lj - --j Cm by minimizing the well-known variance or SSQ clustering criterion 

1 m 

9n{C) := \\xk-xc^\? min (4.1) 

^ i=l keCi 

with respect to all m-partitions C = (Ci, Cm)- An equivalent problem is: 

1 m 

gn{C,Z) := Wxk-ZiW'^ -> min=:g*„ (4.2) 

” i=l k€Ci 

where minimization is also over all m-tuples Z = {zi, Zm) of ’class centers’ 
zij Zm ^ R^- It is well-known that any optimum pair C*, Z* is related by 
the stationarity equations z* = Xc* (centroids) and C* = { k ^ O \ \\xk — 
z*\\ z=z — Zj\\} } (^ = minimum- distance partition] 

see Bock 1974). 

In the following it will be convenient to consider a continuous counterpart 
of the variance criterion (4.2): 

m p 

g{B,Z) :=Y. \\x - ZiW"^ ■ f{x) dx min=:g* (4.3) 

i=i JBi B,z 

where minimization is over all m-partitions B = (J3i, ..., J5^) of R^ and all 
center systems Z = {zi,...,Zm)- Here f{x) is a fixed distribution density 
(from which the data a;i,.a; 2 ,... could be sampled). Any optimum config- 
urations B*,Z* will be related by similar stationarity conditions as before: 
z* = Ef[X\X G B*] is the conditional expectation of X in the domain B* 
(under /), and B* = {x e R^\ - 2 :*|| = minj{||x - ^*||} } is the Voronoi 

region in R^ of the centroid z * . 

There are essentially two types of iterative clustering algorithms for solving 
(approximately) the optimization problems (4.2) or (4.3): 

(l.a) The /c-means algorithm; 

Considering (4.2) for a fixed data set {xi, ..., we begin with an arbitrary 
initial m-partition and minimize iteratively the criterion gn{C,Z) with 
respect to Z and C in turn. This results 

• in a sequence {Z^^^}t=o,i... of center systems Z^^^ with class representatives 
= x^it) (centroids of the classes Cf^), and 
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• in a sequence of m-partitions where is the minimum-distance 

partition generated by the center system 

Both sequences are steadily improving the criteria (4.1) and (4.2) and end 
with a stationary value of Qn which is (hopefully) close or identical to the 
global minimum g^. 

Similarly, the /c-means approach for the continuous criterion (4.3) deter- 
mines, for t = 0, 1, ..., the center system Z^^^ by Ef[X\X G and 

a new classification as the minimum-distance partition generated by 

Z(^), 



(l.b) The sequential stochastic approximation approach for (4.3): 

Here the data points Xi,X 2 , ... are sequentially observed and after presenta- 
tion of Xn+i the center system and the classification or obtained 
from {xi, ...,Xn} are suitably updated. This learning algorithm paradigm is 
emphasized in the NN context and includes the classical method of Mac- 
Queen (1967) (yielding partitions of {xi, ..., see Bock 1974, 1998), 
and the stochastic approximation methodby Bravermann (1966) and Tsypkin 
& Kelmans (1967) which concentrates on the sequence of 

class representatives (which provide the minimum-distance partitions 
of RP). 

This stochastic approximation algorithm begins with m centers xi 

(for i — l,...,m) and updates, after observing Xn+\ with n > m, the cur- 
rent system Z^^^ = (4”\ ..., z^)) of class centers (typically not centroids) 
according to the recursive formula 



^(^+ 1 ) _ 
^2 



+ ain • {Xn+i - for i = i* 

for i i*, 



(4.4) 



where i* is the index of the center 4^^ with minimum distance from Xn-^i, 
i.e.: ||xn+i — 4^^ 1 1 — niiiii<j<m{||^n+i — {competitive learning). The 

predetermined (possibly class-specific) ’learning factors’ ain (’inverse tem- 
perature’) with lim^_^oo c^in — 0 fulfill typically the conditions <^in = oo 
and < oo (such as ain = 1/n) in order to guarantee convergence. 

It appears that for an increasing number n of data points both approaches 
(l.a) and (l.b) are basically resolving the same continuous optimization 
problem (4.3). In fact, for a sufficiently regular density f{x) and if the data 
points rri, X 2 , ... are independently sampled with the same sufficiently regular 
density f{x) on RP we have the following result: 

Theorem 1: 

(a) If the optimum m-partition B* of the continuous criterion (4.3) is unique 
(up to a permutation of classes), then the optimum configurations {Z*^'^\ 
C*M) for the classical SSQ criterion (4.2) and the minimum criterion values 
9n — gn{C*^^\ converge, for n -> oo, to their continuous counterparts 
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(eventually after permuting the class indices): 

^ z:^E[X\XeB:] (4.5) 

9n ■= miji 9n{C,2) g* := inmg{B,Z) = g{B*,Z*). (4.6) 

in probability and almost surely (Pollard 1981; see also Bock 1985, 1996a). 

(b): Analoguous limits are obtained for the class centers 4^^ which result 
from the sequential approximation approach (4.4) and for the corresponding 
minimum-distance partitions of {xi, ..., Xn}: 

^ z; = E[X\XeB*] (4.7) 

5n:=5n(C("),^'")) ^ g*:=g{B*,Z*) (4.8) 

in probability and almost surely, where {B*, Z*) is a stationary pair (local mi- 
nimum) for g{B, Z). For details see Braverman (1966), Tsypkin k Kelmans 
(1967) and Bock (1974, Chap. 29). 

The previous results were derived by Braverman in a much more general 
setting by considering a generalized partitioning criterion of the form: 

m „ 

G{B,Z) ■=Y; (j)i{x\zx,...,Zm) ■ f{x) dx -> min (4.9) 

i=l JBi B,z 

where the squared Euclidean distance ||x — in (4.3) has been replaced 
by a general class-specific distance function (f)i{x;Z) = <f>i{x;zi,...,Zm)- In 
analogy to (4.2), a finite-sample version is given by: 

m 

Gn{C,Z) ;= -yj2 4>i{xk\zi,-,Zm) ->• min. (4.10) 

^ i=l k€Ci 

Both criteria will be met in the context of a generalized Kohonen method 
in the next sections. As before, we may design two types of minimizing 
strategies: 

(2. a) A generalized /c-means approach: 

Considering (4.10) first, we determine iteratively a sequence of steadily im- 
proving ’center systems’ and m-partitions oi O — {1, ...,n} as fol- 
lows {t = 0,1,2,...): results from minimizing (4.10) with respect to Z 

(for fixed C = by solving the normal equations for (4.10): 

m 

^ ^ ^ Z\^ ..., Zjji) 0 i 1 , ..., m. (4.11) 

Then is defined to be the minimum-distance partition of O generated 

by Z^*^ with classes := {k € O \ <f)i{xk;zi, ...,Zm) = ^fn^{<l>i^{xk\zi, 
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•••) Zm)} } for i = 1, m. - In the case of the continuous criterion (4.9) we 
proceed similarly by solving, with the t-th partition of RP, the normal 
equations: 

m „ 

X] / (t) /(a;)da: = 0 (4.12) 

i/=i 

for 2 i, ..., Zm and use the resulting center system to obtain the minimum- 
distance partition of BP. 

(2.b) The generalized stochastic approximation algorithm for (4.9): 

After observing a new data point a;„+i the current center system Z^'^^ (ob- 
tained from xi, ...yXn) is updated as follows: 



~(«+i) 



^in * ^ 



) zW'i 
) •••) '^m ) 



with m class-specific learning sequences {ain)n&N- Similarly as in Theorem 
1 we obtain (Tsypkin & Kelmans 1967): 

Theorem 2: 

If Xi,X 2 , ... are independently sampled from a suitably regular density f and 
if some non-singularity conditions are fulSlled (for details see, e.g., Bock 
1974) the sequence of center systems Z^^\ (4.13), completed by the cor- 
responding minimum-distance partitions B^'^^ (using the class-specific dis- 
tances 4>i{x, Zb9)) converge in probability and almost surely to a stationary 
pair {B* , Z*) of the criterion G(B, Z). 

Moreover, the criterion values := G{B^'^\ converge to G* := 
G{B*,Z*). 



5 Generalized Kohonen Maps: 

The K-criterion 

The previous clustering approach is more or less explicitly used when con- 
structing ’self-organizing maps’ (SOMs) in the NN framework (Kohonen 
1982, 1997). Similarly as in principal component analysis (PCA), the prob- 
lem is to visualize a set of high-dimensional data points Xi,X 2 , ... € BP. PCA 
projects the data onto a low-dimensional hyperplane of dimension two, say. 
In contrast, Kohonen combines basically: 

(1) a sequential clustering strategy which produces a large number m = a- b 
of ’mini-classes’ Gi , ..., Cm or class centers z\, ..., Zm € B^ (typically termed 
’weight vectors’), with 

(2) a visualization (thematic map) where the class centers are represented 
by the (eventually relabeled) vertices Pi,..., Pm (termed ’neurons’) of a two- 
dimensional rectangular lattice C of size ax b such that neighbouring class 
centers Zi,zj in B’’ will be mapped onto vertices Pi,Pj which are also close 
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in C in terms of the path distance 6{Pi,Pj) in £, thereby hoping that the 
’topological’ structure of the data points can be sufficiently reproduced. 
Kohonen’s approach was more or less heuristically motivated and used a 
generalized updating formula for the centers in analogy to the sequential 
approximation formula (4.4): 

^ ^ . K{S{Pi,,Pj)) ■ (x„+i - zj"') j = 1, 

Here i* is the class index with minimum Euclidean distance |la:„+i — 4"^ II 
and K{6) is a weighting function which controls the set of neighbours Pj of 

Pi* whose corresponding class centers 2 :^^^ are actually updated. Typically, 
a binary treshold function was used with K{6) = 1 for S < e and K{5) = 0 
else. 

It is evident from the previous problem description that if we could find a 
suitable clustering criterion of the type (4.9) or (4.10) from which Kohonen’s 
sequential method derives, we could not only design a finite-sample approach 
for the considered visualization problem (via fc-means) , but also characterize 
the asymptotic performance of Kohonen’s algorithm via Theorem 2 and 
generalize the approach to other, geometrically or statistically motivated 
data models (see section 6). 

Such a criterion has been found by Anouar, Badran & Thiria (1997) for the 
finite-sample case: 

1 m m 

UC,Z) 

i=zi keCi lj=i 

The weight function K{6) is typically decreasing from 1 to 0, and the square 
bracket: 



-> min. (5.2) 
c^z 



4>i{x-Z) Y. K{5{Pi,Pi))-\\x-z^\\^ (5.3) 

defines a dissimilarity between a data point x £ RP and the vertex Pi of 
C (representing the mini-class Ci) as a weighted average of all m squared 
distances ||x — in which the term for j — i has maximum weight K{0) = 
1. As a continuous counterpart we define the following K-criterion: 



mz) 




f{x) dx mm (5.4) 



which is a special case of the continuous criterion (4.9). - In the following 
we will write Kij := K{S{Pi^ Pj)). 

Given that we are now in the classical framework of section 4, we can de- 
rive the corresponding ’self-organizing’ clustering or mapping algorithms as 
follows: 
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(3.a) A finite-sample A:-means analogue of Kohonen’s method: 

For a fixed number n of data points, we minimize 5 „(C, Z), (5.2), partially 
with respect to the center system Z and the m-partition C in turn, for 
t = 0,l,...: 

(1) Thus, from a current m-partition of O, we obtain the optimum cen- 
ters: 






X, 



(t) 



-E 



w. 



W 



X^{t) 



i — 1 , m 



(5.5) 



J = 1 



as a weighted mean of class centroids where the normalized weights 

Wif := \cf^\- Kij/w\*^ with (5.6) 

m 

:= Y:\cf\-Kii (5.7) 

depend on the current class sizes 

(2) Then, for the given center system Z = Z^^\ (5.2) is minimized w.r.to C 
yielding the minimum-distance partition of O with classes 

:= {ke {1, ...,n} I i = arg min^ <j>j{xk-, Z'^*^) } (5.8) 



(i = 1, and the class-specific dissimilarity measure ^ (5.3). 



(3) After obtaining stationarity at step t — T^ say, each center (or class 



('T'\ 

C\ ^^) is assigned to the lattice point Pj of £ which has minimum distance 






(3.b) Stochastic approximation for the K-criterion: 

The AT-criterion (5.4) is a special case of the criterion G{B,Z), (4-9), such 
that Tsypkin & Kelman’s updating formula (4.13) applies. Calculating 
the gradients in (4.13) yields: The generalized Kohonen algorithm (5,1) is 
the stochastic approximation device for the new K-criterion (5.4) (see also 
Anouar & al. 1997). 

Remark: A corresponding analogue of MacQueen’s sequential clustering al- 
gorithm for minimizing (5.4) is described in Bock (1998). 

Invoking Theorem 2 and an adaptation of Theorem 1(a) we can characterize 
the asymptotic behaviour of the previous algorithms (3. a), (3.b) for n — oo: 



Theorem 3: 

Assume that the minimum of the K-criterion g, (5.4), is obtained for a 
unique optimum configuration (Z*,B*) and that the data points Xi,X 2 ,... 
are independently sampled from a sufficiently regular density f{x) (and some 
singularities are excluded). Then: 



(a) The optimum configurations for Anouar’s SSQ criterion gn, 

(5.2), and the minimum criterion values converge: 
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m m 

Q:=^w*j-E[X\X€B*] (5.9) 

j=i • j=i 

9 n-=^f 9 n{C,Z) g* := min g{B,Z). (5.10) 

in probability and a.s. with normalized weights w*j := P{X € B^) • Kijjw* 
and w* := EILi P{^ ^ Bl) ’ K,, (i = 1, ..., m). 

(b) : Similar convergence properties hold for the center systems ob- 
tained from the generalized Kohonen updating formula (5.1) which yields 
the stochastic approximation method for the K-criterion (5.4): 

4"^ -> C* i = l,...,m (5.11) 

5<”^=5n(C^"\-2(”)) ^ r-=~g{B\Z*). (5.12) 

where {B*, Z*) is a stationary pair for (5.4) and the minimum-distance 
partition of {xi, ...^Xn] generated by the center system (using the dis- 
tances (j)i). 

Theorem 3 shows that asymptotically the same results can be obtained by 
using the sequential approximation method (5.1), (3.b) or the nonsequential 
fc-means algorithm (3. a). The extent to which ’topologically correct’ solu- 
tions are obtained in both cases can be investigated by simulations where, 
e.g., the underlying density f{x) is concentrated near a low-dimensional, 
bended manifold of R^. Questions of ’topological correctness’ are also dis- 
cussed in Cottrell & Fort (1989), Tolat (1990), Bouton & Pages (1993) and 
Fort & Pages (1996). 



6 Statistically Motivated SOMs 

The criterion-based approach to Kohonen networks suggests various general- 
izations which parallel the extension of the variance criterion (4.2) to criteria 
suggested by geometrical considerations (shape of the clusters, class-specific 
hyperplanes) or statistical models (maximum-likelihood clustering). The 
new resulting clustering criteria allow for the visualization aspect by incor- 
porating the transformed path distances Kij = K{S{Pi, Pj)) with the result 
that mini-classes Ci,Cj with similar parametrizations 'di.'dj, say, are repre- 
sented by neighbouring vertices in £. The following special cases 1., 2., 3. 
illustrate the general maximum-likelihood approach for SOMs presented in 
4. below. 

1. Ellipsoidal mini-clusters: If the data are supposed to be 

near a low-dimensional manifold, we may ask for ellipsoidal mini-clusters 
which may fit this manifold locally better than the spherical mini-clusters 
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underlying the clustering criteria (4.2) and (4.3). This leads to criteria of 
the type: 



9n{C,Z) 



1 m 

keCi 



-Wxk- Zj\\l-i] 

j=l ■’ 



mm 

c,z,s 



(6.1) 



where minimization is also with respect to the positive definite p x p ma- 
trices restricted to det{T>i) = 1. The corresponding /c-means ap- 

proach incorporates scatter matrices of the type Wi — Y^keCj Kij{xk — 

xCj){xk - Xcj (for Xcj see (5.5)) and :== Wi/\Wi\^IP in analogy to Bock 
(1996b, Section 3.2, 1996c). - A suitable stochastic approximation approach 
proceeds by updating formulas for the parameters ^nd E . 



2. Principal-component-based SOM: We characterize each mini-cluster 
Ci by a class-specific hyperplane Hi (with a prespecified dimension s < p) 
instead of a single center point Z{. Hi is parametrized by Hi = {y = 
O'i + Ylr=i^r'^ir I ^ R } with a vector ai G and an p x s ma- 

trix Vi = (t’ii, ..., u^s) of orthonormal column vectors vn^...^Vis G R^. The 
corresponding finite-sample K-criterion is given by: 



9n{C,Z) 



1 m 

i=i keCi 



m 



Y:K,j-d{xk,Hj) 

J='^ 



min (6.2) 

c,n ^ ^ 



which is minimized with respect to C and to the hyperplane system H = 
Here d{x,Hi) ~ miiiyeff; ||a; - y|p = ||(7p - ViV^){x - 
is the distance of a point x ^ RF from the hyperplane Hi. In analogy to 
classical principal component clustering (Bock 1974, Chap. 17; 1996a, b,c) the 
corresponding A:-means algorithm has to determine, for a fixed C = the 
generalized principal component hyperplanes which belong to the previously 
mentioned scatter matrices Wi. A similar non-linear approach (’algorithm 
ACC) is proposed by Herault & Guerin-Dugue (1997). 

3. Regression clustering for SOMs: Here the data are given by pairs 
{^k:Vk)^ A: = 1,2, ..., where Xk G R^ is an explanatory vector and pk G R^ 
a response vector. In the mini-class Ci we assume a local regression model 
y{x) — ai~{- BiX with a vector Oi G R^ and an s x g matrix Bi of regression 
coefficients. A corresponding clustering or mapping criterion would be: 



1 m 

9n{C,Z) 



1=1 keCi 



Y.K^r\\yk 



BiXkW^ 






min . (6.3) 

C,a,B ^ ^ 



Whereas the A:-means approach proceeds by iteratively computing (gener- 
alized) regression hyperplanes and assigning each data point to the closest 
regression hyperplane, the stochastic approximation approach has to update 
successively the vectors and the regression coefficients in Bi. 
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Other regression-type approaches have been proposed, e.g., by Ritter (1997), 
Moshou k Ramon (1997), Kohonen et al. (1997) and Badran et al. (1997). 

4. SOMs based on statistical models and a modified m.l. criterion: 

The clustering criteria (6.1), (6.2) and (6.3) resemble the classical cluster- 
ing criteria which are derived from a probabilistic model: Assuming the 
existence of a fixed unknown ’true’ m-partition C of n objects we suppose 
that for all objects k from the same class Q the (random) data vectors Xk 
are distributed with the same density f{x\ 'di) with a class-specific param- 
eter vector di. This fixed-classification model (Bock 1996a,b,c) leads to a 
maximum-likelihood (m.l.) clustering criterion whose generalized version in 
the SOM context is provided by: 



gn{C,0) 



m 



EE 

i — 1 k^Ci 




Kij ■ {-log f{xk;'dj)} 



min (6.4) 



where [...] defines a likelihood-based dissimilarity measure (t>i{x] 6) in analogy 
to (5.3) (with 6 = (i?!, ..., t 9^)). This criterion and its continuous counterpart 
will yield a thematic map where classes Q, Cj with similar model parameters 
di.'dj (’class representatives’) are displayed by neighbouring vertices Pi,Pj 
of the lattice £. 

The resulting maximum-likelihood k-means algorithm for SOMs is easily ob- 
tained from the classical m.l. /c-means strategy: After having found the t-th 
m-partition the optimum parameter for the classes C{ = is 
obtained by solving the generalized m.l. equations: 



^ ^ MxjM_ 

h 



z = 1, ..., m. 



(6.5) 



Here the z-th equation amounts to the usual m.l. estimation of i^i from data 
Xi,,..,Xn ~ /(s'^f) where each observation Xk with k G Cj has a multi- 
plicity (weight) Kij. - The subsequent partition is defined to be the 

minimum-distance partition generated by when using the dissimilarity 
measure [...] from (6.4). 

The corresponding stochastic approximation algorithm proceeds, for n = 
no + 1, by the updating formula: 



m 

+ Kij-^logf{xn+x-,d), ( 6 . 6 ) 

j=i 1.?=.?'"' 

for i = l,...,m. The number no must be so large that the initial data 
Xi, ...,x„o provide reasonable estimates for all class parameters in the 
first stage. Obviously, the previous approaches 1., 2., 3. are special cases of 
this general m.l. approach whose importance may be seen in the fact that 
it can cope with qualitative variables and discrete distributions (such as 
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loglinear models) as well. - Ambroise & Govaert (1996) have proposed a 
SOM-type approach based on a Gaussian mixture model and using a modi- 
fied EM algorithm. 

References 

ADORF, H.-M. and MURTAGH, F. (1988): Clustering based on neural network 
processing. In: D. Edwards, N.E. Raun (eds.): COMPSTAT 1988. Physica 
Verlag, Heidelberg, 1988, 239-244. 

ANOUAR, F., BADRAN, F. and THIRIA, S. (1996): Topological maps for mix- 
ture density. Intern. Conf. on Artificial Neural Networks, ICANN’96, Bochum, 
Germany. 

ANOUAR, F., BADRAN, F. and THIRIA, S. (1997): Cartes topologiques et 
nuees dynamiques. In: S. Thiria et al. (eds.) (1997), 190-206. 

AMBROISE, Ch. and GOVAERT, G. (1996): Constrained clustering and Koho- 
nen self-organizing maps. J. of ClassiGcation 13, 299-313. 

BADRAN, F., DAIGREMONT, Ph. and THIRIA, S. (1997): Regression par 
carte topologique. In: Thiria, S., et al. (eds.) (1997), 207-222. 

BEZDEK, J.C. (1981): Pattern recognition with fuzzy objective function algo- 
rithms. Plenum Press, New York. 

BOCK, H.H (1974): Automatische Klassihkation. Vandenhoeck & Ruprecht, Got- 
tingen, 1974, 480 pp. 

BOCK, H.H (1979): Fuzzy clustering procedures. In: R. Tomassone (ed.): Anal- 
yse de donnees et informatique. INRIA, Le Chesnay, France, 1979, 205-218. 

BOCK, H.H. (1985): On some significance tests in cluster analysis. J. of Classi- 
fication 2, 77-108. 

BOCK, H.H. (1996a): Probability models and hypotheses testing in partitioning 
cluster analysis. In: Ph. Arabie, L. Hubert, G. De Soete (eds.): Clustering and 
classification. World Scientific, River Edge, NJ, 1996, 377-453. 

BOCK, H.H. (1996b): Probabilistic models in partitional cluster analysis. In: A. 
Ferligoj, A. Kramberger (eds.): Developments in data analysis. FDV, Metodoloski 
zvezki, 12, Ljubljana, 1996, 3-25. 

BOCK, H.H. (1996c): Probabilistic models in cluster analysis. Computational 
Statistics and Data Analysis 23, 5-28. 

BOCK, H.H. (1997): Simultaneous visualization and clustering methods as an 
alternative to Kohonen maps. In: G. Della Riccia et al. (eds.), 1997, 67-85. 

BOCK, H.H. (1998): Clustering and neural networks. In: A. Rizzi, M. Vichi, 
H.H. Bock (eds.): Advances in data science and classification. Springer Verlag, 
Heidelberg, 1998, 265-278. 




56 



BOUTON, C., and G. PAGES (1993): Selforganization and convergence of the 
one-dimensional Kohonen algorithm with non-uniformly distributed stimuli. Stoch- 
astic Processes and Applications 47 (1993) 249-274. 

BRAVERMAN, E.M. (1966): The method of potential functions in the problem 
of training machines to recognize patterns without a teacher. Automation and 
Remote Control 27, 1748-1771. 

BUTLER, G.A. (1969): A vector field approach to cluster analysis. Pattern 
Recognition 1, 291-299. 

COLEMAN, J.S. (1971): Clustering in n dimensions by use of a system of forces. 
Journal of Mathematical Sociology 1, 1-47. 

COTTRELL, M. and FORT, J.-C. (1989): Etude d’un processus d’ auto-organi- 
sation. Ann. Inst. Henri Poincare Probab. Statist. 23, 1-20. 

DELLA RICCIA, G., KRUSE, R. and LENZ, H.-J. (eds.) (1997): Learning, 
networks and statistics. CISM Courses and Lectures no. 382. Springer, Wien 
1997. 

FORT, J.-C. and PAGES, G. (1996): About the Kohonen algorithm: strong or 
weak self-organisation? Neural Networks 9, 773-785. 

HAYASHI, Ch., OHSUMI, N., YAJIMA, K., TANAKA, Y., BOCK, H.-H. and 
BABA, Y. (Eds.) (1998): Data science, classification and related methods. Proc. 
IFCS-96. Springer- Verlag, Tokyo, 1998. 

HERAULT, J. and GUERIN-DUGUE, A. (1997): Analyse de donnees multidi- 
mensionnelles par reseaux de neurones auto-organises. In: S. Thiria et al. (1997), 
153-170. 

HOPFIELD, J.J. (1982): Neural networks and physical systems with emergent 
collective computational capabilities. Proc. Natl. Acad. Sci. USA 79, 2554-2558. 



HOPFIELD, J.J. and TANK, D.W. (1985): Neural computation of decisions in 
optimization problems. Biological Cybernetics 52, 141-152. 

KAMGAR-PARSI, B., J.A. GUALTIERI, J.E. DEVANEY and KAMGAR-PARSI, 
B. (1990): Clustering with neural networks. Biological Cybernetics 63, 201-208. 

KOHONEN, T. (1982): Self-organized formation of topologically correct feature 
maps. Biological Cybernetics 43, 59-69. 

KOHONEN, T. (1997): Self-organizing maps. Springer, New York. 

KOHONEN, T., S. KASKI, H. LAPPALAINEN and SALOJARVI, J. (1997): The 
adaptive-subspace self-organizing map. Workshop on Neural Networks, Helsinki, 
Website. 

MACQUEEN, J. (1967): Some methods for classification and analysis of multi- 
variate observations. In: L.M. LeCam et al. (eds): Proc. 5th Berkely Symp. on 
Math. Stat. Probab. Univ. of California Press, Los Angeles, vol. 1, 281-297. 




57 



MOSHOU, D. and RAMON, H. (1997): Extended self-organizing maps with local 
linear mappings for function approximation and system identification. Workshop 
on Neural Networks, Helsinki, Website. 

MURTAGH, F. (1995): The Kohonen self-organizing map method: an assessment. 
J. of ClassiBcation 12, 165-190. 

MURTAGH, F. (1996): Neural networks for clustering. In: Ph. Arabie, L. Hu- 
bert, G. De Soete (eds.): Clustering and classification. World Scientific, River 
Edge, NJ, 1996, 235-269. 

POLLARD, D. (1981): Strong consistency of /c-means clustering. Annals of 
Probab. 10, 919-926. 

RITTER, H. (1997): Neural networks for rapid learning in computer vision and 
robotics. In: G. Della Riccia et al. (eds.), 1997, 25-39. 

SATO, M. and SATO, Y. (1995): Neural clustering: Implementation of clustering 
model using neural networks. Proc. IEEE Conference, 1995, 3609-3614. 

SATO, M. and SATO, Y. (1998): Additive clustering model and its generaliza- 
tions. In: Ch. Hayashi et ah, Proc. IFCS-96, 1998, 312-319. 

SHEPARD, R.N. and ARABIE, Ph. (1979): Additive clustering: Representation 
of similarities as combinations of discrete overlapping properties. Psychological 
Reviews 86, 87-123. 

SPATH, H. (1985): Cluster dissection and analysis. Theory, FORTRAN pro- 
grams, examples. Horwood, Chichester, 1985. 

THIRIA, S., LECHEVALLIER, Y., GASCUEL, O. and CANU, S. (1997): Statis- 
tique et methodes neuronales. Dunod, Paris, 311 pp. 

TOLAT, V.V. (1990): An analysis of Kohonen’s self-organizing maps using a 
system of energy functions. Biological Cybernetics 64, 155-164. 

TSYPKIN, Y.Z. and KELMANS, G.K. (1967): Recursive self-training algorithms. 
Engineering Cybernetics USSR, 1967, V, 70-79. 

WRIGHT, W.E. (1977): Gravitational clustering. Pattern Recognition 9, 151- 
166. 




Data Model and Classification by Trees 

Olivier Gascuel^ 

^ Dpt. d’Informatique Fondamentale, LIRMM, 

161 rue Ada, 34392 Montpellier, France 

Abstract: Let D be an ultrametric (or tree) distance, T its tree representation, 
and A a dissimilarity matrix that is an estimate of D. Our aim is to reconstruct 
T from A. This problem is encountered, for example, in Biology and Archaeology, 
where T represents the history of some living species or relics from the past, and 
where A estimates the pairwise divergence times between these species or relics. 
Moreover, we assume that the variance- covariance matrix of the elements in A is 
available. This matrix may be a consequence of the experimental process used 
to collect the data, or induced by the data model at hand. We propose a way of 
benefiting from this additional knowledge, by modifying the usual agglomerative 
(or ascending) algorithm. At each step of the algorithm, this involves reducing 
the dissimilarity matrix A so that the variance of its elements is minimized. In 
this way, we obtain better estimates for selecting the pair of objects to be ag- 
glomerated and estimating the edge- lengths. The method we propose applies to 
both ultrametric and tree distances, and it has a low computational complexity. 
This method has been used to deal with data issued from biological sequences, 
which implies a rather complex, non-diagonal variance-covariance matrix. Very 
good results have been obtained, specially concerning the ability to recover the 
structure of the true tree T. 



1 Introduction 

Let D = (dij) be an ultrametric (or tree) distance over n objects; T the 
unique ultrametric (resp. positively valued) tree representing D, also re- 
ferred to as the true tree; A = {5ij) a dissimilarity matrix with each element 
5ij being an estimate of the distance dij. In this article, we are interested 
in methods aimed at finding T from A. In other words, we try to construct 
an ultrametric (resp. positively valued) tree T, associated with the ultra- 
metric (resp. tree) distance D — {di^ which should be as close as possible 
to T. This problem is encountered in domains where attempts are made 
to reconstruct an inheritance phenomenon as, for example, the history of 
manuscripts in Archaeology or the evolution of species in Biology. The tree 
T represents the history of these objects (manuscripts or species). In the 
ultrametric case, the distances dij represent the divergence times between 
objects, and the dissimilarities 5ij are estimates of these divergence times. 
In the general case, we only know that the distances dij are tree-like. For 
example in the Evolution domain, these distances represent the number of 
mutations that separate two species. However, some mutations may be hid- 
den due to back, parallel or multiple events and, moreover, the mutation 
rate cannot be considered constant over time, so we only have estimates 
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of the mutational distances and we must be satisfied with finding unrooted 
additive trees (Swofford et al. (1996)). 

The main approaches to deal with this problem may be sorted into two 
families. The ” agglomerative” family (Barthelemy and Guenoche (1991); 
Gordon (1996)) progressively builds up a tree by picking a pair of objects, 
creating a new node that represents the cluster of these objects, and com- 
puting a new reduced dissimilarity matrix where both objects are replaced 
by this node. In the ultrametric case, the agglomeration criterion simply 
consists in choosing the objects whose dissimilarity is minimal, while the re- 
duction formula varies depending on the view (Sneath and Sokal (1973)) or 
model (Degens (1983, 1988); Bock (1996)) we have on the data. For exam- 
ple, it has been shown (Degens (1983)) that the (unweighted) average-link 
algorithm is well suited if we assume the model (Sij) = (dij) + (sij), where 
the SijS are i.i.d. noise variables with null expectation and normal distri- 
bution. In the tree distance case, two different agglomeration criteria are 
used in the ADDTREE (Sattath and Tversky (1977)) and NJ (Saitou and 
Nei (1987); Studier and Keppler (1988)) algorithms, but it can be shown 
(Gascuel (1994)) that the latter is a continuous version of the former. To 
our knowledge, very little work has been done concerning the reduction for- 
mula, the only two solutions being the weighted (used by ADDTREE and 
NJ) and unweighted (Bandelt and Dress (1986); Vach and Degens (1991); 
Gascuel (1997b)) approaches, the latter being clearly the most appropriate 
in the above i.i.d. normal model. 

The second family (Barthelemy and Guenoche (1991); Mirkin (1996)) uses 
mathematical programming techniques and directly seeks an ultrametric (or 
tree) distance D that is close to A. We usually rely on the least-squares cri- 
terion — (Cunningham (1978); Hubert and Arabie (1995); Gascuel 
and Levy (1996)), but some algorithms (De Soete (1983); Felsenstein (1997)) 
have the ability to minimize the weighted least-squares: ^v^-^(dij — 
where Vij is the variance of the 5{j estimate. Moreover, Bulmer (1991) has 
suggested using the generalized least-squares: ~ — Ski)^ 

where is now an element in the inverse of the variance-covariance ma- 
trix of the 5ij estimates, which will be denoted as V in the following. This 
solution is appealing since it provides minimum variance estimators in the 
usual case where the values to be estimated are continuous. However, no 
efficient algorithm exists to infer trees in this sense, partly due to the com- 
putational cost of inverting F, which is O(n^). Moreover, minimizing any 
of these three criteria is NP-Hard (Day (1987)), so practical methods are 
heuristics and do not guarantee that the optimal solution will be found. 

In practice, we often have access to information concerning the matrix V. 
It may be explicitly known, for example when the data are obtained by an 
experimental process with repeated measures, missing values, and/or non- 
independent experiments. In some models, this matrix can be estimated 
from the data. Finally in some cases, its shape and properties are induced 
by the model. In this paper, we propose a simple agglomerative approach 
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that takes advantage of this additional knowledge. It involves, at each step, 
reducing the dissimilarity matrix A so that the variance of its elements is 
minimized, and we called it the minimum variance reduction (MVR) method. 
In this way, we obtain better estimates for selecting the pair of objects to be 
agglomerated and computing the edge-lengths, at least when A is unbiased. 
This assumption, which is implicit in most (e.g., least-squares) methods, will 
be made from now on. Our approach is possible because the agglomerative 
scheme possesses some degrees of freedom, with both ultrametric and tree 
distances. In the following, we first present this scheme and its degrees of 
freedom (Section 2). Then we define the minimum variance reduction (Sec- 
tion 3) and describe the algorithms and their computational cost (Section 
4). Finally, we briefly present some applications and discuss the approach 
(Section 5). 

2 The Agglomerative Scheme and its Degrees 
of Freedom 

We first deal with ultrametrics, then later with tree distances. However, the 
line of approach is the same in both cases: we present general algorithms 
whose degrees of freedom are represented by numerical parameters; more- 
over, regardless of the values of these parameters the algorithm guarantees 
to find the correct tree representation when the dissimilarity is an ultramet- 
ric (resp. tree) distance. This correctness property is essential to achieve 
our objective, which is to recover the true tree T, especially when A is equal 
(or close) to D. 

2.1 Ultrametric Distance 

The algorithm is summarized in Figure 1. 



Until the number of clusters is 1, repeat: 

Select the pair {x,y} to be agglomerated by minimizing 5xy’, 

Agglomerate x and y to form a new cluster u whose level of formation is 6xy\ 

For every i ^ x^y reduce the dissimilarity matrix using: 

^ui ~ ^i^xi "F (I ^yi’ 

Output the tree. 

Figure 1: The agglomerative scheme for ultrametric distances. 

The degrees of freedom are represented by the A parameters, and it is easy 
to prove that it recovers the correct tree whenever A is ultrametric and 
regardless of the values of these parameters. Notably, the As can be different 
for each agglomerated pair {x, y} and each z, so that a more explicit notation 
would be Xixy^ instead of A^. Our reduction formula is less general than Lance 
and Williams’ (1967) and Jambu’s (1977), but our purpose is different. For 
the sake of statistical consistency, we impose the correctness property, while 
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these authors wished to provide a very general formula, able to represent 
most of the existing algorithms. Moreover, our formula generalizes numerous 
classical correct algorithms, as the single-link, complete-link, average-link 
and weighted average-link algorithms (Gordon (1996)). For example, the 
(unweighted) average-link corresponds to Xi = n^/ (ria^ + n^), where Uy is 
the number of initial objects contained in the cluster y. However, some 
other correct approaches cannot be represented using this scheme (neither 
Lance and Williams’ nor Jambu’s), as for example the median procedure 
studied in (Degens (1983)). 



2.2 Tree Distance 

The algorithm is summarized in Figure 2. The degrees of freedom are repre- 
sented by the Wx,i^ Wy^i and parameters (as above, a more explicit notation 
would be, respectively, w^^ixy^ '^y,ixy and Xixy)- This algorithm recovers the 
correct tree whenever A is a tree distance and regardless of the values of 
these parameters. The key of the proof is the correctness of the agglomera- 
tion criterion, shown in (Atteson (1997); Gascuel (1997b)). 



Initialize the number of clusters: r f- n; 

Until the number of clusters is 2, repeat; 

Compute the sums Rz = ^zi any z e {1, . . . ,r}; 

Select the pair {x^y} to be agglomerated by maximizing Rx Ry — {r ~ 2) 6xy\ 
Agglomerate x and y to form a new cluster with root u\ 

Estimate the length of the edges (x,u) and (y,u) using: 

dyu — '^y,i i^xy + 

where = YL'^y,i = 1/2- 

Decrease the number of clusters: r -f- r — 1; 

Reduce the dissimilarity matrix using for every i ^ x,y: 

^ui ~ Xi6xi “h (1 Aj) Syi Xidxu (1 Aj) dy^. 

Output the tree. 

Figure 2: The agglomerative scheme for tree distances. 



This algorithm is a generalization of the NJ algorithm that is obtained with 
— '^y,i = 1/ (2(^ — 2)) and Xi = 1/2. To our knowledge, this gen- 
eralization has never been mentioned before. Moreover, we have studied 
in depth another version (Gascuel (1997b)) of this general algorithm, first 
suggested by Vach and Degens (1991), which is defined by Wx^i = Wy^i = 
riil (2(n — rix — Uy)) and Xi = rix/ {ux + riy). This new version is called 
UNJ, where U means unweighted. Indeed, UNJ can be seen as the tree 
distance version of the unweighted average-link (or UPGMA) algorithm. 
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3 Minimum Variance Reduction 

As above, we first deal with ultrametrics, then with tree distances. In both 
cases, the goal is the same: to reduce the dissimilarity A so that the variance 
of the new estimates (5„j is minimized. We will see that the problem is much 
simpler with ultrametric than with tree distances for which we propose a 
simplified, approximate solution. 

3.1 Ultrametric Distance 

Let V be the variance-covariance matrix of the estimates, and let Vij^ki 
be the covariance of 6ij and 6ki- For the sake of simplicity, the variance of 
6ij will be denoted as Vij. We have to minimize: 



Var = Var + (1 - 

^i'^xi (f ‘^yi d" 2Aj^l ^i)'^xifyi 

This a second-degree polynomial whose solution is: 

i y . J- y . — 2v ■■ ^ 

For example, when we assume the i.i.d. normal model, we obtain: 

i'^y'^i) '^x 

\j “I” l/ ij^y'^i^ '^x '^y 

In other words, we get (as expected) the average-link algorithm. 

3.2 Tree Distance 

The new estimates 5ui now depend on both the A and w parameters and they 
are not linear with respect to these parameters. The approach we propose is 
not optimal, but it induces a reasonable computational cost. It involves first 
minimizing the variance of the edge-length estimates dxu and dyu, and then 
minimizing the variance of the estimates. We describe both procedures 
successively. 

Minimizing the variance of dxu (and dy>i^ with respect to the w parameters 
does not present any particular difficulty. When developing the variance of 
dxu we obtain a second-degree polynomial with respect with the w param- 
eters, whose coefficients are obtained from the variances and covariances in 
V. Moreover, we have to fulfill the constraint ^Wx,i = 1/2. Considering the 
partial derivatives, we obtain a linear system with r — 1 equations and un- 
knowns: {r — 2) w parameters +1 Lagrange multiplier corresponding to the 
constraint. Solving this linear system requires O(r^) in time. An alternative 
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solution, less expensive in computational time but less accurate, consists of 
minimizing the variance of the term X) ~ ^yi) that appears in both dxu 

and dyu, with opposite sign. A unique O(r^) minimization provides the set 
of optimal values u;*, and we set w* ,^ = Wy ^ = w*. This simplified solution 
is optimal when for every i we have v^y^xi = '^xy,yi^ When the covariances 
are null (or assumed to be), the problem is much simpler since we have the 
analytic solution: w* — K/ {v^i + Vyi)^ where K is the normalization term 
imposed by the constraint. In this case, estimating the edge-lengths requires 
0(r). 

Once dxu and dyu have been determined, we have to minimize for every 
i^x,y: 

V dT CLT T (l ^i)^yi ^idxu (l ^i)dyy^ 

= + Var (^dxu) - 2 Cov (^Sxi, dxu)) 

T (l Aj) (^^yi (^dyy^ 2 Cov (^dyi^ ^yu)) 

+2 A^ (1 — A^) (vxiyyi T Cov ^dxui dyu) 

Cov i^xi') dyu^ Cov (^dyi.) dxu)) 

The exact method consists in first computing the variances and the covari- 
ance of dxu and dyu^ and for every i the covariances of Sxi and dxu^ dyi and 
dxu, • • • This requires O(r^) in time. Then, for each i we obtain a second- 
degree polynomial whose minimum is obtained by the derivative in 0(1). 
In other words, minimizing the variance of all Sui estimates requires O (r^). 
When the covariances are null, this minimization requires 0(r). A simplified 
method is to neglect the last two terms in the reduction formula and only to 
minimize the variance of Ai5xz + (l~Aj)52/2. We then obtain solution (1). This 
simplified solution is optimal when the covariances are null. Moreover, when 
the values of the A*s are supposed to be close or identical, the last two terms 
in the reduction formula become independent of 2 , and it can be shown that 
only the variance of the first two terms has an influence on the further steps 
of the algorithm. In this case (e.g., which occurs with biological sequence 
data (Gascuel (1997a))) the solution (1) is fully justified. Moreover, instead 
of using multiple A^s, we use the average value. 



4 Algorithms 

We have shown how the minimum variance reduction can be achieved, when- 
ever we have the variance-covariance matrix V. As the estimates in the 
dissimilarity matrix A change along the course of the algorithm, we have 
to update the matrix V at each step. Once again, the task is easier for 
ultrametric than for tree distances. In both cases, we will first present the 
updating method, then the whole algorithm. 
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4.1 Ultrametric Distance 

Once the minimum variance reduction is achieved, we have new estimates 5ui 
with shape 5ui = + (1 — Xl)Syi, and we must determine their variance 

and their covariances with the other estimates in A. We have the following 
formulae: 

For any i, j ^x,y,u: 

COV {6ui, Suj) = X*X*jVxi^xj + (l ~ A*j Vxi^yj 

+ (1 - a:) X*Vyi,xj + (1 - A*) (l - A*) Vyi,yj 

For any i.kj^x^y^u: 

CoV Ski) = ^iVxi,kl + (1 — A*) Vyi^kh (2) 

Each of these formulae is computed in 0(1) and we have O(r^) formulae, so 
updating the matrix V requires O(r^). The whole algorithm is summarized 
in Figure 3. 

Until the number of clusters is 1, repeat: 

Select the pair {x^y} to be agglomerated by minimizing 6xy] 

Agglomerate x and y to form a new cluster u whose level of formation is Sxy] 

For every i ^ x,y, compute the optimal values X* using formula (1) 
and reduce the dissimilarity matrix by: 

Sui = X*6xi + (1 — A*) 6yi. 

Update the variance-covariance matrix V using formulae (2). 

Output the tree. 

Figure 3: The MVR algorithm for ultrametric distances. 

The most time consuming operation is the updating of the variance-covariance 
matrix V, which induces a global time complexity in 0(n'^). In fact, this is 
not complexity high since the size of ^ is (9 (n^ x n^) == O (n^). In other 
words, this complexity could be seen as optimal since it is linear in the size 
of the data. 

4.2 Tree Distance 

Once the minimum variance reduction is achieved, we have the edge-length 
estimates d^u and dyu^ and new estimates Sui = A*(5a;i + (1 — A*) Syi — X^dxu — 
(1 — A*) dyu. The new variances and covariances are obtained from the for- 
mulae: 



For any i,j ^x,y,u: 

CoV {Sui’j Suj) {vxi^xj “h (^dxu^ CoV dxu^ CoV (^Sxj^ dxy^^ 

”h(l A^)(l Xj) {^yi^yj (^dyu^ COV (^Syi^ dyy^ C OV (^Syj^ 
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+ A*(1 — A*) {vxi^yj — Cov dy^ — Cov {dyj^ dxu^ + Cov (^dxm dyu^) 

+ (1 — A^ ) Aj {^yi^xj Cov (^dyi^ dxu^ Cov (^Sxjdyy^ + Cov (^dxw) dyu^^ , 

For any i^kj^x^y^u: 

Cov {5ui^ Ski) — A*Vxi^ki 0- ~ K) '^yi,ki ~ \ Cov (Skh dxu^ 

-(1-A*) Cov[SkiJyu)^ ( 3 ) 

To compute in 0(1) each of these O(r^) formulae, it is sufficient to have 
the variances and the covariance of dxu and dyu (already computed, see 
§3.2), and the covariances Cov {Ski.dxv^ and Cov {Ski.dyy^ for any k and 
L Computation of these latter covariances requires 0(r) for each one, and 
there are O(r^) such covariances. Therefore, updating the matrix V can be 
achieved in O(r^). The MVR algorithm for tree distances, is summarized in 
Figure 4. 

Initialize the number of clusters; r n; 

Until the number of clusters is 2, repeat: 

(a) Compute the sums Rz = Szi for any z: G {1, . . . , r}; 

(b) Select the pair {x,y} to be agglomerated by maximizing Ry - {r - 2)Sxy] 
Agglomerate x and y to form a new cluster with root u; 

(c) Compute the optimal values and (§3.2) and set: 

dxu = {Sxy + Sxi — Syi) and dyu = (Sxy + Syi — 6xi)\ 

(d) Compute the variances and the covariance of dxu and dyu, 
and the covariances of both with every dissimilarity 6ki', 

(e) For every i ^ x,y, compute the optimal values A* (§3.2) 
and reduce the dissimilarity matrix by: 

Sui — Al5xi + (1 — A*) Syi — X*dxu ~ (1 “ A*) dyu', 

Decrease the number of clusters: r <- r — 1\ 

(f) Update the variance-covariance matrix V using formulae (3). 

Output the tree. 

Figure 4: The MVR algorithm for tree distances. 

The most time consuming operations correspond to lines (c, d, f), each 
requires O(r^), while lines (a, b, e) require 0{r'^). Therefore, the whole 
algorithm has a time complexity in 0{n‘^). 



5 Discussion 

The MVR approach has been applied to data issued from biological se- 
quences. In this case, the data model induces a very specific matrix V where 
the variances are equal (within a constant factor) to the distances, and where 




66 



the covariance between two estimates 6ij and 6ki is equal (within the same 
factor) to the length of the path common to dij and dki. It follows that 
the covariances can be computed from the variances, and that the variances 
(and covariances) can be estimated from the data. Moreover, regarding to 
the general algorithm given above, we have a simplification that yields an 
O(n^) time complexity. Intensive simulations have shown that when the 
true tree T is a tree distance far from the ultrametric condition, the accu- 
racy of the method in reconstructing T, and specially its structure, is very 
significantly improved compared with the standard NJ algorithm. When T 
is ultrametric, the gain is modest, but systematically positive. This work is 
reported in Gascuel (1997a). 

The MVR approach can be used to study and define algorithms adapted to 
specific data models. For example, we have also used it to design the UNJ 
algorithm, mentioned above, which applies to the i.i.d. normal model (Gas- 
cuel (1997b)). The generality of this approach and the good results already 
obtained, seem to indicate that this is a promising direction of research that 
deserves further investigation. 
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Abstract: We experience nowadays a major shift in the way we understand the 
design and development of educational environments in our institutions. New 
technologies find their way into the traditional educational settings, although a 
multitude of problems has to be dealt with. In this paper authors comment first 
on the general aspects of multimedia technology and multimedia product deliv- 
ery along with problems arising when integrating such products into educational 
environments. Moreover authors present an instructional framework based on 
Cognitive Flexibility Theory and Cognitive Apprentiship prescriptions in their 
attempt to address these problems and propose a solution. 



1 Introduction 

We experience nowadays a major shift in the way we understand the design 
and development of educational environments in our institutions. Multime- 
dia technology combined with computer networks have already introduced 
plenty of powerful ways to electronically organize and deliver courses to the 
students. At the same time instructional designers understand that it is not 
merely the technological advances that should define the characteristics of 
the educational environment, but mostly the specific educational perspec- 
tive and the cognitive opportunities that the educators wish to offer to the 
students. New technologies find their way into the traditional educational 
settings, although a multitude of problems has to be dealt with such as 
lack of infrastructure, insufficient training of software designers in pedagogy, 
lack of teacher and trainer training, lack of critical mass of good and relevant 
products and lack of information and evaluation data for users on products 
which are available. In the following we comment first on the general aspects 
of multimedia technology and multimedia product delivery along with prob- 
lems arising when integrating such products into educational environments. 
Moreover, we go on by presenting the framework we have developed in our 
attempt to address these problems and to propose a solution. 




69 



2 Multimedia Technology 

Multimedia technology is known to integrate the so called ’’Four C’s”: Com- 
puters, Consumer electronic products, Communication capabilities and Con- 
tent. Multimedia products for various purposes (educational, training, recre- 
ational, informational) can be delivered to the end user using either package 
media (off line) or telecomm media (on line) or even a hybrid scheme inte- 
grating efficiently these two basic distribution technologies that have already 
emerged. Under the term ’’package media” we include all media that need a 
physical support and where the information is prerecorded such as videodisc 
or videocassette in the analog media world and CD-ROM or CD-I in the 
digital domain. Telecomm media on the other hand include all media that 
are transmitted to a receiver. Networks such as Internet and information 
delivery via Web is a well known example of Telecomm media. The mass 
market for multimedia educational products, whether on optical disks (CD- 
ROM and CD-I) or in the form of services via telematic networks, is destined 
to develop swiftly. 

3 Problems Concerning the Design of Tech- 
nology Based Educational Environments 

Although computer multimedia hardware and software are increasingly avail- 
able in ever dropping prices, integrating this technology into educational 
institutions is not simply a matter of infrastructure development. After the 
first years of naive enthusiasm about the capabilities of the medium a great 
concern emerged between researchers and educators about the pedagogi- 
cal issues related to the design of educational environments based on new 
technologies. A major debate (that is still going on) deals with the rela- 
tionship between the medium and the method. Advocates of the so called 
’’strong” media theory propose that ”in a good instructional design, media 
and method are narrowly integrated, and, consequently, the learner con- 
structs knowledge in interaction with medium and method” (Kozma (1991)). 
Opponents of this view support a more ’’weak” media theory claiming that 
’’the primary advantages of using new electronic media such as computers, 
television and video disks for teaching may be economic and not psychologi- 
cal, i.e. under some conditions they make learning faster and/or cheaper but 
no one medium contributes unique learning benefits that cannot be obtained 
from another medium” (Clark (1983, 1992)). Another intriguing problem in 
technology based education is how one should proceed in order to introduce 
new technologies into the traditional classroom. It is well understood that if 
the educational system wants to satisfy society demands for better trained 
individuals who react faster and are more open to change then instructional 
methods should undergo a re-engineering leading of course not to substitu- 
tion of the teacher by the computer-based learning but to a better integration 
of interactive multimedia resources as additional tools in pedagogical sys- 
tems centered on human interactions. The problem is that teachers in most 
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of the cases lack sufRcient experience, training and understanding in order 
to modify their instructional approach thus using the medium too much as a 
mere add on to an existing, unchanged classroom setting. A problem arises 
as well because software designers do not always take in mind research re- 
sults of cognitive science concerning the way people learn. Many multimedia 
educational products are basically atheoretical, based only on the technol- 
ogy provided features and capabilities. Although the domain of education 
is ill structured in many of its features allowing various approaches accord- 
ing to various needs, research results already indicate important guidelines 
to keep in mind when designing educational settings for the technologically 
equipped environment. 



4 A Framework for the Design and Develop- 
ment of Hypermedia Educational Environ- 
ments 

In dealing with the problems relative to technology based educational en- 
vironments we designed and developed a general purpose framework which 
proposes the combination of certain instructional methodologies in order to 
fruitfully organize the delivery of instruction to the students. In many of 
the contemporary knowledge domains the traditional lecture based approach 
fails to introduce the students to the complexity and the divergence of knowl- 
edge application. This is especially true in a intermediate or advanced level 
of instruction when the acquisition of the more structured characteristics of 
the domain (mainly domain concepts) has already been achieved and the 
there is need for a more holistic, complexity addressing approach. In ad- 
dition it is important nowadays to integrate in our teaching views of every 
day knowledge application, offering to our students the chance to better pre- 
pare themselves for the world of work. Having in mind these two important 
considerations our framework proposes that instruction should begin within 
the context of a case-based environment where properly chosen cases are 
structured in a specific way in order to present to the students real world 
problems and scenarios. The way that we choose to structure cases is pre- 
scribed by Cognitive Flexibility Theory (Spiro k Jehng (1990), Jacobson et 
al. (1995)) and can be understood as an effort to guide students to cross 
the cognitive ’’landscape” of the domain, following various paths and thus 
constructing a deeper, multi faceted understanding of the presented knowl- 
edge. One learns by criss-crossing conceptual landscapes much as a traveler 
understands better the complex form of a landscape after having followed 
several paths and faced several views of it from various stand points. The 
educational environment instructs the student to see and study these cogni- 
tive paths, offering what we call ’’guided criss crossing” of the domain. Right 
after this structured activity our framework proposes the introduction of the 
student into a Cognitive Apprenticeship environment. Such an environment 
offers to the students: a) Modeling of the knowledge, which may be mod- 
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eling of the physical processes underlying phenomena we want students to 
understand and modeling the thought processes underlying expert perfor- 
mance, b) Scaffolding, that is support given to students as they carry out 
a task, c) Coaching: a range of activities such as choosing tasks, modeling 
how to do them, providing hints and scaffolding, diagnosing problems and 
giving feedback, challenging and offering encouragement, structuring how to 
do things and d) Educational simulation of procedures and principles of the 
cognitive world we want students to understand, giving them the opportu- 
nity to actively engage in experimentation and trial of their own ideas and 
get constructive feedback to their problem solving efforts. 

5 ISTOS: Application of the Educational 
Framework to the Computer Networking 
Domain 

We implemented the guidelines of our framework by developing ISTOS 
(Demetriadis et al. (1997)), a hypermedia environment for supporting learn- 
ing in Computer Networking Domain. ISTOS is an environment that not 
only introduces students to the abstract and highly structured aspects of the 
computer networking domain but guides them simultaneously to understand 
how this knowledge is practically used by domain experts in every day appli- 
cations. The cognitive apprenticeship environment in ISTOS allows students 
to actively engage in knowledge inquiry activities and experience the results 
of their decisions. Instruction in ISTOS is organized in a series of levels. 
Each level is based upon a certain educational model of the domain knowl- 
edge, i.e. a complete set of facts, concepts, procedures and principles that 
describe the domain and presents to the student/users three distinct loci of 
activities: Analysis, Synthesis and Apprenticeship. Analysis and Synthesis 
guide students to criss crossing activities of the domain while Apprenticeship 
offers an environment for cognitive apprenticeship. 

5.1 Analysis 

ISTOS begins instruction by presenting and analyzing to the students rep- 
resentative cases of real-life problem situations concerning the Computer 
Networking domain (i.e. problems of computer networks installations or up- 
grades) thus anchoring the abstract knowledge structures to concrete prob- 
lems. Each case is divided into three case-sections, for better organizing the 
content to be delivered. First case-section deals with network needs and 
specifications in the case described, the second one presents the selected 
hardware components and the third case - section comments on the network 
software installed and used. While studying cases in analysis students are 
offered support in three major ways a) the Thematic Commentary, b) the 
Glossary, and c) Internet Connection to specially designed Web pages, 
a) In each case-section student/users can study the appropriate ’’thematic 
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commentary” that accompanies the case-study. The thematic commentary 
is prescribed by Cognitive Flexibility theory (Spiro & Jehng (1990), Jacob- 
son et al. (1995)) as the kind of comments that connects the more abstract 
concepts of the domain to the more concrete and specific characteristics that 
these concepts display in the specific case. It is the kind of comments that 
help students understand the general through the specific and vice versa. 
For example in the ’’Hardware” case-section where reference is made to the 
specific topology and communication media used in the network presented, 
there are also the themes of ’’Topology” and ’’Communication Media” that 
explain the general concept and comment on the specific form that this con- 
cept displays in the specific case, b) Glossary is organized as a list of the 
most frequently used concepts in the domain. There are text anchored hy- 
perlinks in the case-sections that lead to the Glossary items when ’’clicked” 
upon and students may also call the Glossary facility whenever they need it 
and select from the list the term they wish to see information for. Glossary 
also contains animation and video files that may help students’ understand- 
ing of the topic, c) According to our design the content delivered in ISTOS 
is divided into two distinct subcategories: ’’local” and ’’remote”. ’’Local” 
content includes everything that is contained on the ISTOS CD-ROM and is 
delivered locally to the user. There are two properties that characterize this 
content, a technical and a pedagogical one: technically this content contains 
any big file that would be time consuming if delivered through network (e.g. 
video files, high quality images, complex interactions). Pedagogically it is 
content that refers to relatively unaltered information (abstract concepts, 
well established procedures and implementations) which is expected to re- 
main valid in the coming two or three years. On the other hand, ’’remote” 
content includes the kind of information that is subject to faster change 
(such as market trends and hardware and software prices) . This kind of in- 
formation (mainly text and highly compressed images) is organized in Web 
pages on our server so that updating can be easily accomplished. Connecting 
to Web server and displaying the pages is incorporated into the environment 
of ISTOS (no external Web browser needed) and so students experience a 
seamless integration of local and remote information delivery that presents 
both the more theoretical and scientific aspects of the domain along with 
market related and faster changing applied knowledge. 

5.2 Synthesis 

In Synthesis student/users are guided to synthesize a more complex and in 
depth understanding of the presented material. Students engage in guided 
criss-crossing and are supposed to overcome the case-sensitiveness of con- 
crete presentations, developing a more abstract, complex and multi-pathed 
understanding of the material presented. In this way knowledge that is not 
explicitly contained in the system becomes available as a product of the 
criss-crossing activity. Each question in the synthesis activity triggers the 
crossing of the content following a new path. Students are guided to follow 
hyperlinks to the thematic comments (that were previously available when 
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studying the case-sections). This guided thematic criss-crossing offers the 
student the chance of reviewing the content in an entirely different way. Stu- 
dents now have a certain question in mind and are seeking information to 
formulate an answer. Moreover the themes that they are guided to study 
offer the chance of understanding how a more abstract concept (say ’’Topol- 
ogy”) is taking concrete forms in the various cases (’’Bus Topology”, Ring 
Topology”, ’’Star Topology” etc.) and so to become experienced with the 
different ’’landscapes” that the domain might exhibit. 

5.3 Apprenticeship 

Analysis and Synthesis are activities prescribed by Cognitive Flexibility the- 
ory in order to help student to understand the complexities of the domain 
by criss crossing the offered cognitive ’’landscapes”. The question remains 
whether a more freely structured reviewing of content would add to student 
content understanding. ISTOS follows the general advice of gradually intro- 
ducing students to more freely structured educational environments (Collins 
(1996)) and offers an ’’apprenticeship” module which is the final activity 
in each level of study. In apprenticeship student-users play the role of a 
computer networking expert responsible for a network services company, 
challenged by domain problem situations and trying to formulate proper an- 
swers or propose accepted networking solutions. While Analysis employs a 
rather expository way of presenting knowledge and Synthesis offers a guided 
criss crossing experience (i.e. hyperlinks already organized for students to 
follow) the aim of Apprenticeship is to support both memory retention and 
deeper understanding by offering the chance for a more freely structured 
and task completion oriented content reviewing. This is done by motivat- 
ing and engaging student/users into activities that allow them to test in 
practice, correct and complete the domain knowledge model that they have 
acquired by working in Analysis and Synthesis. In Apprenticeship module 
the user interface is structured as a typical office environment where stu- 
dents may use a number of facilities available in order to review content 
and offer answers and solution to posed problems. When students open the 
office door ’’virtual” clients start entering the office posing questions related 
to various aspects of the content (’’what is...” questions asking about the 
more abstract concepts, ’’mini cases” questions demanding the proposal of 
a network solution, ’’visual recognition” questions demanding the proper se- 
lection between images of network hardware components). There are links 
available for the students that connect to various facilities of the environ- 
ment such as: the Glossary, the Case Reviewer (allows students to freely 
select the cases or thematic commentary they wish to review), the Network 
Simulator (a special part of the overall program that allows student/user to 
virtually create a network making all the necessary hardware and software 
arrangements and propose it as a possible implementation in a mini case 
problem), the ’’modem” icon (connection to the Web server for reviewing 
remotely delivered information), the ’’printer” icon (printing of texts and im- 
ages), the ’’Activity Results” (results of current activity in Apprenticeship) 
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and finally the ’’Assistant” image which constitutes an on-line and context 
sensitive help facility that students may use in any time in order to get hints 
for best performance. Questions that have been answered using assistant’s 
help are repeated later on until student can give an accepted answer without 
using any helping hints from assistant. Apprenticeship thus constitutes an 
environment that offers student motivation for self reviewing the resources 
available while trying to answer questions and thus succeed a better level of 
knowledge integration. 



6 Evaluation of Educational Environments 

Evaluating an educational environment is a task that comprises the imple- 
mentation of various methods and approaches in order to produce useful 
information about the educational benefits or shortcomings of the specific 
environment. Evaluation methods may be seen as belonging to two distinct 
categories depending on the purpose of their implementation: ’’formative” 
methods applied in the design or early development phase are expected to 
produce information to support the better formation of the educational en- 
vironment while ’’summative” methods try to offer insight understanding 
after the educational environment has been used by end users. Evaluation 
methods also can be categorized as quantitative or qualitative methods de- 
pending on the methodology implemented for the evaluation. In the case 
of hypermedia environments there is still also another category: the ’’user 
interface” evaluation methods that try to elicit information relative to over- 
all usability of the user interface. A good user interface is considered to 
be an interface which is cognitively transparent, enabling users to complete 
the tasks with the minimum possible cognitive load thus allowing them to 
concentrate more on what they want to do rather than how to do it. A well 
known method of user interface evaluation based on expert opinions is the 
so-called Walkthrough (and the more recent version of ” Jogthrough” (Row- 
ley k, Rhoades (1992)) where experts are asked to judge the software and 
try to locate possible shortcomings that a would be user might encounter. 
These methods try to evaluate a user interface considering what would have 
happened the very first time that an end user would sit in front of the in- 
terface and work with it. Wishing to elicit information on how easily a user 
might get accustomed to the specific design of an educationally oriented in- 
terface (that is an interface that is going to be repeatedly used), we modified 
the Jogthrough method by introducing in our expert questionnaire the time 
variable in terms of the user repetitions of the tasks of the system. We pro- 
pose that this might be a useful way to allow the experts panel to express 
their evaluation on how easily a specific interface would permits users to de- 
velop in time abilities of satisfactory task completion. We call this method 
’’Graphical Jogthrough” since the experts’ opinions are graphically repre- 
sented and statistically elaborated in order to extract quantitative results 
about user friendliness and intuitiveness of the interface. We applied this 
method to Network Simulator (the part of ISTOS that permits the simu- 




75 



lated construction of a computer network) and our main conclusion was that 
the method may offer substantial help to the user interface especially if the 
experts team possesses an interdisciplinary background. Designers should 
fully understand the kind of expertise that guides evaluators to formulate 
their design proposals. If this expertise is lacking cognitive and pedagogi- 
cal dimensions (restricted for example only to multimedia development and 
domain knowledge expertise) then it might lead to design proposals that 
designers team should reject in order to follow the original pedagogy that 
their user interface should support. 

7 Research Focus 

We are currently developing ISTOS as a multitude of educational hyperme- 
dia titles that cover every important aspect of Communication Technologies 
(conventional LANs, high - speed LANs, FR, SMDS, ISDN, ATM). Our re- 
search focuses mainly upon novice students and we will try to locate educa- 
tional benefits and possible shortcomings that they experience while working 
in such a case based learning environment. Hopefully they will acquire ab- 
stract concept domain knowledge accompanied by the ability to successfully 
identify practical problems and apply proper solutions. 
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Abstract: Assigning objects to some similarity classes is fundamental to the 
process of scientific discovery and even to daily life. This universal process, that 
started some millenia ago by giving generic names to objects, has been the subject 
of automatic data processing procedures for about half a century. Nowadays, a 
broad choice of techniques for the classification of objects, described by a set of 
multidimensional variables, are fully operational. These techniques are all based 
on the principle that similar objects should be gathered in a same cluster, whereas 
dissimilar objects belong to different clusters. The paper situates the question of 
the ‘natural’ classes in a broader perspective before proposing a conceptual frame, 
within which a more pragmatic approach is developed, in line with the different 
classification algorithms and the concept of anisotropic parameter space. Against 
this background, a number of well-known methods are analysed and compared. 



1 Introduction 

Unsupervised classification or clustering of objects must be considered as 
one of the basic intellectual activities of the human being. 

When wandering on the earth, the early homo sapiens came across many 
objects. His reaction has been twofold : either he gave the objects a personal 
name, John, or Mike, or Jack, or - and maybe in addition - he gave a generic 
name to a number of objects that were similar, and another generic name 
to a group of objects that were dissimilar from the first group, but mutually 
similar. Examples of such generic names are : tree, or flower, or mountain. 
By the latter approach, it became possible to consider similar characteristics 
of the objects with the same generic name : trees are high and strong and 
can burn; this is not true for flowers. This approach, that is generally not 
possible with objects having the same personal name, is the basis for any 
language. 

Indeed, this approach makes or breaks the whole cultural development of 
mankind; in literature, an author describes objects of a kind, which the 
reader - without having seen exactly the same objects - can recognize as 
objects similar to those he knows : the reader can understand what the 
writer is writing about. 

Similarly, in science the study of some objects results in laws that are appli- 
cable to the other objects of the same class. 
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So, depending on the kind of phenomenon one is interested in, the group will 
be different. A whole system of hierarchical structures may be constructed, 
in which the highest levels are the most general, while adding characteristics 
creates sub-groups. 

As an example, one may consider the family of mammals in which more 
than 4000 species (=sub-groups) are inventoried, such as horses, donkeys, 
goats, cats,... etc. An interesting point with respect to the definition of a 
species is that all members of a species must be able to produce offspring 
with the other members of the same species, even after the first generation. 
This is evidently the case for all the individuals from the class “horse” . But 
a horse and a donkey can also copulate to produce mules. However, horses 
and donkeys are not considered to be of the same species, because their 
offspring, the mule, is infertile. On the other hand, there are frogs all over 
the USA, from the east to the west coast. These frogs can copulate with 
their neighbours, and are considered to belong to the same species. However, 
a frog from the east coast cannot copulate with one of the west coast : here 
the species must be considered as being obtained by a technique related to 
the single linkage clustering. 

This shows, that even for such natural seeming groups as zoological species, 
not only the considered characteristics are decisive, but that the similarity 
concept is also implied. Both these elements are instrumental in determining 
the structure of a classification, including such problems as the number of 
classes in a population. 

This paper gives an overview of a number of classification or clustering tech- 
niques (section 2), with special attention to the way of translating typical 
features of a classification into the technique of clustering. High emphasis 
is put on fuzzy clustering as a method offering better flexibility, but also 
suffering from some drawbacks that should be avoided. 

The way to determine the number of clusters and the related question of 
natural clusters is considered in the next section (section 3). Not only the 
theoretical frame, but also a few examples are presented. 

Finally, the classical clustering methods are extended by considering a new 
constrained clustering technique (section 4). It is shown how this technique 
can be integrated in existing algorithms, which are hardly disturbed by this 
addition. 

2 Clustering Techniques 

The subject of clustering cannot be treated properly, without giving a defi- 
nition of the clustering concept. One of the earliest and clearest definitions 
was reported by Everitt [Everitt 1974]. It runs as follows : “grouping the 
objects into a number of classes, such that objects within classes are sim- 
ilar in some respect, and unlike those from other classes”. This definition 
implicitly supposes that the grouping proces is unsupervised i.e. that the 
clusters are not characterised a priori, but only as a result of grouping a 
number of objects. 
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There are two major families of techniques. The first one comprises all 
hierarchical techniques, by which the n objects Xi that have to be grouped, 
are subsequently partitioned in 1 to n clusters or the other way round. These 
techniques are known as divisive and agglomerative respectively. Typical for 
these techniques is, that at each level, at least one cluster is subdivided into 
two new clusters (divisive method) or two clusters are joined to form one 
new cluster (agglomerative method). It should also be noted that in this 
family of techniques, the problem of the number of clusters is inexistent as 
any possible number of clusters is provided. 

The other family comprises all non-hierarchical techniques with many sub- 
families. One of these sub-families, the optimization techniques, is further 
considered in this paper. 

The optimization techniques consist of forming clusters, as the result of the 
optimization of an objective function. In general, the optimization is only 
valid for a fixed number of clusters so that the problem of determining the 
optimum number of clusters must be solved independently. 



2 . 1 . One of the oldest forms, developed almost in parallel by many authors, 
is the minimum variance or within group sum of squares method. References 
are provided by Forgy [Forgy 1965], Jancey [Jancey 1966], MacQueen [Mac- 
Queen 1967] and Ball and Hall [Ball and Hall 1967] and by many others. 
Later it was put in a broader context, a.o. by Bock [Bock 1974] and Spath 
[Spath 1985]. The objective function may be written as : 

F = Ylf^ul (rci - nt)' {xi - Ht) (1) 

t=l i=l 

where X{ = {xn, . . .X{p, . . .Xim) represents the vector of characteristics of 
object i and where fit = • • • fJ'pu • • • l^mt) represents the center vector of 

cluster t, in the space of the characteristics m} of the objects, with t 

taking the integer values from 1 to k] this latter index corresponds to the 
number of clusters considered. The uu are special variables indicating the 
belongingness or membership of object i to cluster t in such a way that 

Uit ^ 0 for all i and t (2) 



and 



k 

'^Uit=l for I € {1, . . .n} (3) 

t=l 

Furthermore, two possibilities occur. Either the allocation of an object to a 
cluster is crisp, in which case the variables uu are boolean or it is fuzzy and 
in that case the Uu can take any value in the range [0, . . . , 1]; in both cases 
conditions (2) and (3) must be satisfied. 
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In the case of the crisp model, the optimization is highly combinatorial. 
For each object, the belongingness to each cluster must be considered in 
combination with the allocation of all other objects to all clusters. Differ- 
ent heuristic algorithms exist to solve this problem, such as the exchange 
method. However, the fate of heuristic algorithms is, that one is never sure 
that the obtained optimum is the global one, rather than just a locally best 
value. Practice learns, that with these methods, the optimum found is in- 
deed, more often than not, merely a local one. 

This is less the case with fuzzy optimization, where all variables are con- 
tinuous, and iterative methods, such as steepest descent, or the like, can be 
used [Bezdek, 1981]. 

As all variables are continuous, the analytical expression of the optimal 
values for the unknown variables, membership functions and cluster centres, 
can be deduced from the zero value of the first derivative of the objective 
function at any optimum. Hence, considering first the partial derivative with 
respect to one gets, after some manipulations 

E 

Mt = ^4 (4) 

Eul 

i=l 

provided that for all t at least one > 0, which can be shown to be generally 
the case, at least at the optimum [Trauwaert, 1991]. 

For the derivative with respect to the membership function the objec- 
tive function has to be extended to a Langrangian function, including the 
constraints (3). Upon putting its derivative equal to zero and eliminating 
the Langrangian factors, one gets : 





llBit 

= k 

E VBit 

r=l 


(5) 


with 




(6) 



provided xi does not coincide with fit- If some object i does coincide with one 
or more its membership functions uit with respect to the corresponding 
clusters add up to the value one, whereas the other uu for the same object 
i, but for different clusters f, are all zero. 

By the definition (1), this method is looking for spherical clusters (or more 
generally hyperspherical, if the dimensions of the space of the characteristics 
is different from three; for simplicity we will omit the prefixe “hyper”, as 
the context always clearly indicate whether it is required or not). Because 
of this feature, groups with different shapes will normally not be found by 
this method. This spherical constraint is overcome by the next method. 
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2.2. The spherical shape results from the fact that, due to (1), distances are 
bound to have the same weight in all directions. The parameter space can 
be made anisotropic by introducing a matrix G in equation (1), replacing 
the unit sphere by a unit ellipsoid : 

F = i^i - l^t)' G {Xi - Ht) (7) 

The optimal values of the elements of the matrix G, and hence the shape of 
the ellipsoids may be determined [Spath 1985, Trauwaert et al. 1993] as 

G = (8) 

k n 

E E u% i^i - i^t) - i^t)' 

with S = (9) 

n 

Remarkably enough, this will have neither influence on the expression of 
the optimal cluster centres, which will remain as in (4), nor on the optimal 
membership functions (5), provided the expression for Bu is adapted as 
follows 



Bit = {xi - Ht)' S ^{xi- nt) (10) 

It has been shown, that by this technique, clusters of ellipsoidal shape, all 
similar and similarly oriented, can be found [Trauwaert 1991]. 



2 . 3 . If however one is looking for a speciflc ellipsoidal shape that can be 
different for each cluster, the matrix G has also to be speciflc for each cluster, 
say Gt and (7) becomes [Spath 1985] 



k n 





F 


= 5Z T “it {xi - lit)' Gt {xi - lit) 
i=l 


(11) 


with 


Gt 




(12) 


and 


St 


n 

i=l 

nt 


(13) 


with 


nt 


II 


(14) 



Again, the expressions for the centres and the membership functions remain 
the same as (4) and (5) respectively, with 

Bit = \St\^^^{xi-i,tyS^^{xi-fit) 



(15) 
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By this method, clusters can be found with different shapes and orientations. 
However, it appears that the volumes of the clusters (as measured by the 
determinants of the scatter matrices St) must be similar. 



2.4. This deficiency can be overcome by using a slightly different objective 
function, deduced from a maximum likelihood analysis [Trauwaert et al. 
1993]. 

F = - fj't)' 57^ (Xi - Ht) + Y. 'og I'S'f I ( 16 ) 

t=l i=l t=l 

where the optimal values for [it, St and Ut remain expressed as before (4, 13 
and 14), and where the optimal membership becomes 






with At 



l/Bit 

E l/Bir 



1 

Bit 



E Ar/Bir 

—k ^ 

E l/Bir 



^{m-log|5t|} 



(17) 



(18) 



and Bit as in (10). 

In this model, the clusters are no longer bound to have similar volumes, 
but there is still a bias towards this feature, and a tendency to find singular 
matrices. 



2,5. These shortcomings are largely eliminated if the objective function is 
reduced to the sum of the volumes of the clusters, or of some generalized 
squared scattering distance. 

The former model is expressed by 

(19) 

t=i 

and the latter by 

^ = E (20) 

(=1 

with m being, as before, the dimensions of the space of the characteristics. 
In these cases the memberships are similar to (17), 

with At = ^m/nt\Stp 

Bit = |5'tr(a:« - Ht)' {xi - Ht)/nt 



and 



( 21 ) 

( 22 ) 
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with 7 equalling 1/2 in the former case, and 1/m in the latter. 

Experience with these programmes indicates, that any bias against fairly un- 
equal clusters has greatly dissapeared, and certainly that the latter method 
appears to partition clusters in a fairly correct way, even when they are 
largely different in volume and number of objects [Trauwaert et al. 1993, 
Rousseeuw et al. 1996]. 



3 Number of Clusters 

The question of how many clusters there are in a collection of objects to be 
clustered, is a very fundamental one. Up to now this question was avoided 
in most optimization problems, by supposing that this number was given as 
an extraneous parameter. However, normally this number cannot be given 
without knowing how the objects will be allocated to the clusters; nor can 
the objects be allocated to the clusters without knowing their number. So 
at least some iterations are necessary. 

In all the methods presented up to now, and in fact in all classical optimiza- 
tion methods, the optimal number of clusters cannot just be deduced from 
the minimum value of the objective function. Indeed, all classical objective 
functions are uniformly decreasing as a function of the number of clusters. 
This is, because adding a cluster always allows to reduce the average dis- 
tance between objects within the cluster, or the volume of the clusters, or 
whatever the objective being used (Fig. la). 

Acknowledging this fact, many authors have sought a solution, in looking 
for anomalies in the evolution of this objective function with respect to the 
number of clusters [Thorndike 1953, Gower 1967, Hardy 1993], or for the 
optimal difference between this function and some other reference function 
[Marriott 1971, Calinski and Harabasz 1974, Ratkowsky 1978]. Other au- 
thors have sought an answer, in developing significance tests [Beale 1969, 
Duda and Hart 1972] or mixture maximum likelihood models [Wolfe 1970]; 
still others want to compare the obtained result to that obtained with the 
same algorithm, applied to randomly distributed and hence completely un- 
structured objects [Hardy 1993] (Fig. lb). In doing this, the problem was 
shifted but not solved: the problem became one of defining and justifying 
the reference function or model. 

Hence, it clearly appears that a fundamental solution to the problem cannot 
be found in this way, unless the decreasing classical objective function can be 
complemented by a penalty function expressing the disadvantage of having 
too many clusters, or the diminishing advantage of having smaller, but more 
numerous clusters. 

The simplest situation occurs, when it may be assumed that there is a 
penalty increasing linearly with the number of clusters. Such a situation 
is exemplified in a real problem, in which for each cluster some infrastruc- 
ture must be built, the total cost being directly proportional to the number 
of clusters. Adding this increasing objective function to the decreasing ob- 
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Fig. la Evolution of the hypervolume objective in function of clusters 

Ruspini data 
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Fig. lb Evolution of the hypervolume objective in function 
of the number of clusters 

Unstructured data 



jective functions by one of the previous methods, results in a straightforward 
optimum number of clusters (see fig. 2). 

Of course, the optimum number of clusters now very much depends on the 
importance of the penalty : with a higher penalty the optimum number will 
be reduced and vice versa; but this should not be interpreted as a criticism of 
the method, because it fails to indicate some “natural” number of clusters. 
It should rather be considered as a normal and desirable effect. Furthermore 
it should be considered that the optimum also very much depends on the 
form of the classical objective function, which also varies with the adopted 
clustering method. Hence, it is neither “natural” . 
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Fig. 2 Evolution of the hypervolume objective for Ruspini data with 
constant penalty per cluster 



4 Constrained Classification 

From all the previous discussions, it should be clear that all clustering meth- 
ods contain some constraints. According to the chosen model underlying the 
method, one will find one or other set of clusters, all satisfying in the best 
possible way their own optimization criterion. 

Hence, as there is no universal method, there will neither be any universal 
natural set of clusters. Each method will produce its own clustering scheme, 
and these schemes are somehow predetermined to produce specific types of 
clusters. 

Furthermore, there is neither a natural number of clusters, since the optimal 
number of clusters must be defined by an additional objective, which can take 
many different shapes and, hence, will generally not be defined as natural. 
Considering all these constraints - implicit or explicit - one can even take a 
further step forward. We have seen that the shape and the orientation of 
the different clusters can be defined or not, through the way the covariance 
matrices are restricted or not. One can also ask - still within the frame of 
the clustering methodology - whether the position of the cluster centres can 
be somehow restricted or predefined. 

This restriction can not be such, that the centres are completely fixed, be- 
cause this degenerates the problem of clustering to one of allocating objects 
to the nearest fixed centre, which would be a trivial problem (up to now, all 
cluster centres have always been supposed to be allowed to take any position 
in the space of the characteristics of the objects to be clustered). But the 
cluster centre could be constrained to remain within a sub-space of the orig- 
inal space of characteristics of the objects. This is an intermediate situation 
between completely fixed and entirely free cluster centres. 
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As before, suppose we have n objects Xi to be grouped in k clusters such 
that the cluster centres fit are in a hyperplane defined by : 

m 

Oihl^th = oto (23) 

h=l 



Suppose further, that after comparison of the different possible clustering 
methods, a choice is made for fuzzy A:-means with its objective function 
(1). The solution to this optimization problem can be obtained, putting the 
derivatives with respect to the unknowns equal to zero, after extension of 
the objective function to a Lagrangian function to include constraints (3) 
and (20); hence 






l/Bit 

E l/Bir 



(24) 



with Bit as in (6), and 



f^th ^ 

E ul 

1=1 



f ^ 2 

i=l h=l \i=l 



«o E - E oih 



m 

E a; 

h=l 



E u% 

i=l 



(25) 



The first term of (22) is as in (4), while the second term results from the 
constraint (20) on the position of the centres. 

As the general form of the solution is very similar to that of the uncon- 
strained Fuzzy Isodata model, one can wonder whether the iterative algo- 
rithm will also remain valid. Trials with many examples confirm that this 
is the case. 

It is further believed that this approach will also be valid - mutatis mutandis - 
for all other clustering models and their algorithms. However, these points 
have yet to be checked. 



5 Conclusions 

It has been shown, that there is nothing like “natural” clusters; any clus- 
ter set results from a certain model, and different models produce different 
clusters. It is the task of the analyst, to make the appropriate choice of 
clustering model. 

Similarly, the optimal number of clusters in a non-hierarchical approach must 
result from some additional information on the objective function; generally, 
this information can take the form of a penalty function, uniformly increasing 
with the number of clusters. 

Finally it was shown that the cluster centres can also be constrained, and 
that this constraint, at least if it is linear, can easily be treated analytically, 
and be introduced in the iterative algorithm, without major problems. 




86 



References 

BALL G.H., HALL D.J. (1967); A clustering technique for summarizing 
multivariate data. Behaviour Sci., Vol. 12, 15S-155. 

BEALE E. M.L. (1969); Cluster analysis. Scient. Cont. Syst. Ltd., London. 

BEZDEK J. C. (1981); Pattern recognition with fuzzy objective function 
algorithms. Plenum Press, New York. 

BOCK H. H. (1974); Automatische Klassification. Vandenhoeck & Ruprecht, 
Cottingen. 

CALINSKI T., HARABASZ J. (1974); A dendrite method for cluster anal- 
ysis. Communications in Statistics, Vol. 3, 1-27. 

DUDA O., HART P. E. (1972); Pattern classification and scene analysis. 
Academic Press, New York. 

EVERITT B. (1974); Cluster analysis, Halsted Press, New York. 

FORCY E. W. (1965); Cluster analysis of multivariate data ; efficiency 
versus interpretability of classification. Biometrics, Vol. 21, 768-769. 

COWER J. C. (1967); A comparison of some methods of cluster analysis. 
Biometrics, Vol. 23, 623-637. 

HARDY A. (1993); An examination of procedures for determining the num- 
ber of clusters in a data set. in New approaches in classification and data 
analysis. Diday E. et al. (Eds.) Springer Verlag, Berlin, 178-185. 

JANCEY R. C. (1966); Multidimensional group analysis. Aust. J. Bot, 14, 
127-130. 

MACQLFEEN J. (1967); Some methods for classification and analysis of 
multivariate observations. Proc. 5th Berkeley Symp., 1, 281-297. 

MARIOTT F. H. C. (1971); Practical problems in a method of cluster 
analysis. Biometrics, Vol. 27, 501-514- 

RATKOWSKY D. A., LANCE C. N. (1978); A criterion for determining 
the number of groups in a classification, the Australian Computer J., Vol. 
10, 115-117. 

ROUSSEEUW P. J., KAUFMAN L., TRAUWAERT E. (1996); Fuzzy clus- 
tering using scatter matrices. Computational Statistics & Data Analysis, 
Vol. 23, 135-151. 

SPATH H. (1985); Cluster dissection and analysis. Theory, FORTRAN 
programs. Examples. Halsted Press, New York. 

THORNDIKE R. L. (1953); Who belongs in a family? Psychometrika, 18, 
267-276. 




87 



TRAUWAERT E. (1991): Grouping of objects using objective functions 
with applications to the monitoring and control of industrial production 
processes. Ph.D. thesis, VUB. Brussels. 

TRAUWAERT E., ROUSSEEUW P. J., KAUFMAN L. (1993): Fuzzy clus- 
tering by minimizing the total hypervolume, in Opitz O. et al. (Eds.) In- 
formation and Classification, Proc. 16th annual conf. of the GfKl, Springer, 
Berlin. 

TYRON R. C. (1939): Cluster analysis. Ann Arbor, Edward Brothers. 

WOLFE J. H. (1970): Pattern clustering by multivariate mixture analysis. 
Multivariate Behavioral Research, Vol 5, 329-350. 




PLENARY AND SEMI PLENARY 

PRESENTATIONS 



Finance and Risk 




Prom Variance to Value at Risk: A Unified 
Perspective on Standardized Risk Measures 

Hans Wolfgang Brachinger 

Seminar fiir Statistik, Wirtschafts- und Sozialwissenschaftliche Fakultat, 
Universitat Fribourg, CH-1700 Fribourg, Switzerland 



Abstract: Risk is a concept which matters to many issues in economics and fi- 
nance. The range of risk measures proposed goes from classics like variance to 
modern approaches like Value-at-Risk (VaR). In this paper, after a short charac- 
terization of manager’s intuitive notion of risk, an overview of those risk measures 
is given which try to measure risk in a standardized way independent of indi- 
vidually varying perception. Then, it is shown that all these measures including 
Value-at-Risk, basically, are special cases of a certain well-known family of risk 
measures. From this point of view, the most critical features of each measure, 
particularly of VaR, become immediately evident. 



1 Introduction 

The concept of risk plays an important role in much of the current writings 
on economic and financial issues. Intuitively, risk is a kind of negative feature 
characterizing a decision alternative, e.g., a certain portfolio. Risk is meant 
to be a chance of injury or loss connected with a given action. 

In general, risk is not an objective feature of a decision alternative. It is an 
inherently subjective construct because what is considered a loss and what 
its significance and its chance of occurring are, is peculiar to the person con- 
cerned. Nevertheless, in the economic, especially in the banking literature 
there are various attempts to measure risk in a standardized way indepen- 
dent of individually varying perception. This is necessary for regulatory as 
well as managing purposes. Thereby, the main emphasis lies on the risk 
itself of an alternative, independently of the problem of risk preference and 
of individually varying perception. 

In this paper, first, the intuitive notion of risk is characterized. Then, an 
overview is given of those measures of risk which have been advanced in 
economics and finance to quantify risk in a standardized way. For a more 
comprehensive survey of measures of risk including recently developed eco- 
nomic or psychological theories of subjectively perceived risk see Brachinger 
and Weber (1997). In the third section, it is shown that all of the risk mea- 
sures reviewed, including the modern Value-at-Risk methodology, basically, 
are special cases of the well-known family of risk measures introduced by 
Stone (1973). From this point of view, the most critical features of each 
measure, particularly of VaR, will become immediately evident. 
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2 Managerial Notion of Risk 

The concept of risk has so permeated the economics community that no one 
needs to be convinced of its importance. Still of controversy is what constitu- 
tes risk and how it should be measured. The German Duden (1980), p. 2168, 
defines “Risiko” as “moglicher negativer Ausgang bei einer Unternehmung, 
womit Nachteile, Verlust und Schaden verbunden sind”. The Concise Ox- 
ford Dictionary of Current English (Sykes (1976)) paraphrases the notion of 
“risk” as “exposure to chance of injury or loss”. 

In economics there has been a long tradition of avoiding direct questioning 
of people. Only in the last two decades, some empirical studies of manage- 
rial notions of risk have been conducted. These studies provide some rather 
consistent observations on how managers conceive risk. 

Mao (1970) asked business executives of medium and large companies of 
different industries what they understood by the term “investment risk”. 
Typical statements were, e.g., the following: 

• ‘‘Risk is the prospect of not meeting the target rate of return. ... If you 
are one hundred percent sure of making the target return, then it is a 
zero risk proposition. ” 

• “Risk ... is primarily concerned with downside deviations from the tar- 
get return. However, if there is a good chance of coming out better than 
your forecast, that is a negative risk (a sweetener) which is taken into 
account in determining the security of an investment. ” 

• “. . . the risk of an investment: the chances of losses exceeding a certain 
percent of my total equity ... . ” 

• “... Also, I never worry about the project return going above the target 
return. Risk is what might happen when the return is going to be less. ” 

As one major result of his study, Mao points out that these statements imply 
that “risk is primarily considered to be the prospect of not meeting some 
target rate of return.” 

Hertz and Thomas (1983), p. 10, emphazise that, in empirical studies, “two 
scales are typically identified in relation to a given risk: i.e. a severity grading 
(amount of potential loss) and a frequency or probability grading (likelihood 
of occurrence of loss)...”. For substantiation they refer to an empirical work 
done by Rothkopf (1975): “In his dealings with businessmen he suggests that 
they use the word ‘risk’ in a manner which implies that the risk of a venture 
increases if the likelihood of loss increases, or if the magnitude of possible 
loss increases.” 

Petty and Scott (1981) distributed a questionnaire among the heads of the 
finance departments of each firm out of the May 1977 “Fortune 500” list. In 
this questionnaire, among other things, people were invited to explain what 
they understood by a “risky investment”. At a response rate of 35%, in the 
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very most responses risk was paraphrased in the sense of likelihood of not 
meeting a certain minimum return (cf. Payne, Laughhunn and Crum (1980), 
p. 1041). 

In an extensive study of managerial perceptions of risk, MacCrimmon and 
Wehrung (1986) questioned over 500 top-level business executives. In one 
of their questionnaires MacCrimmon and Wehrung asked these executives: 
‘^What do you mean when you describe a business situation as risky? What 
are the important characteristics of a risky situation?” (cf. MacCrimmon 
and Wehrung (1986), p. 305). The following responses were typical (cf. Mac- 
Crimmon and Wehrung (1986), p. 18 f.): 

• ‘^There is a high degree of loss in undertaking the situation, ” 

• ‘‘High probability of failure due to known threats and weaknesses which 
are not offset by commensurable rewards”. 

Summing up, MacCrimmon and Wehrung (1986) characterize the manageri- 
al notion of risk by the components “Exposure to Potential Loss” , “Magni- 
tude of Potential Loss” and “Chances of Potential Loss” (cf. MacCrimmon 
and Wehrung (1986), p. 9 as well as Table 1.1, p. 19). 

In empirical studies, typically, two dimensions which appear to determine 
risk have been identified: amount of potential loss and probability of occur- 
rence of loss. The risk of an alternative increases if the probability of loss 
increases or if the amount of potential loss increases. Furthermore, there is 
empirical evidence that possible gains may reduce the risk of an alternative. 
Thereby, losses and gains are defined with reference to a certain target out- 
come. This target outcome may be the zero outcome, status quo, a certain 
aspiration level, as well as the best result attainable in a certain situati- 
on. An outcome is regarded as a loss if and only if it falls below the target 
outcome. It is regarded as a gain if and only if it lies above the target return. 
Obviously, indicators which try to measure risk in a standardized way should 
be consistent with these observations on how managers conceive risk. 



3 Standardized Risk Measures 

In the economic, especially the finance literature, risk is most commonly con- 
ceived as refiecting variation in the distribution of possible outcomes, their 
likelihoods, and their subjective values. Risk is treated within some theory 
of decision making under risk as being one significant aspect of the availa- 
ble options. Some of these theories treat risk as a completely independent 
concept and make explicit use of a risk measure, others do not. Within the 
framework of the Expected Utility Model, e.g., a single alternative’s risk is 
not quantified, only an individual’s general attitude towards risk is refiected 
by the shape of his or her utility function and, given the utility function, 
quantified by the well-known Arrow-Pratt measure. Explicit use of a measu- 
re of risk is made in risk- value models (for an overview on such models see 
Sarin and Weber (1993)). 
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As mentioned above, in general, risk is not an objective feature of a decisi- 
on alternative but rather an inherently subjective construct. In this section, 
an overview is given on measures of risk which have been advanced in eco- 
nomics and finance to quantify risk in a standardized way which is widely 
acceptable and independent of individually varying perception. Among all 
the measures reviewed, subjective transformation of values or probabilities 
is not admitted. 

Within risk-value models, traditionally, the risk of an option has primarily 
been associated with the dispersion of the corresponding variable. Not later 
than since Markowitz’s (1959) and Tobin’s (1958) pioneering work on port- 
folio selection, it is common to measure the riskiness of a portfolio by the 
variance or the standard deviation a of its potential outcomes. 

Let a portfolio’s future value or wealth be characterized by a continuous 
random variable w with distribution function F^j and probability density 
function Then, with the mathematical expectation 

+00 

II := E{w) := j wfuj{w)dw, (1) 

— oo 

these risk measures are defined by 

+ 00 

(j^ := var{w) ~ J {w — fuj{w) dw (2) 

-oo 

and 

+00 

a [y {w - fifU{w)dwY^^ . (3) 

— OO 

In the finance context the standard deviation usually is called volatility. 
Already Markowitz himself (1959), pp. 286-297, discussed similar standar- 
dized risk measures. Within these are the expected absolute deviation around 
IJ- 

+00 

j \ w - n \ U{w) dw (4) 

— OO 

and the expected absolute deviation around 0 

+00 

J \w\U{w)dw. (5) 

— OO 



Besides, it has been conventional wisdom in economics and other fields of 
research that risk is the chance of something bad happening. In their famous 
monography, e.g., von Neumann and Morgenstern (1947) only once used the 
word risk explicitely. In a footnote, they defined risk as “the worst that 
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can happen under given conditions”. In this vein, risk is associated with an 
absolutely ‘bad’ outcome or an outcome that is worse than some specific 
target outcome and its probability. Within the risk measures tailored to this 
notion of risk are the lower semivariance 

J {w - ixffa,(w)dw, (6) 

-OO 



the expected value of loss 



0 

J w fw{w)dw, 

— OO 



( 7 ) 



and the probability of loss or probability of ruin 

r 

Pw{w<r)= J U{w)dw. (8) 

— OO 



Thereby, r is a certain target level outcomes lower of which are a loss or 
disastrous to the decision maker. 

In the same vein, Fishburn (1977) proposed the risk measure 

t 

Rf{w) = J {t — w)^ fw{w) dw (A; > 0) . (9) 

— OO 



Thereby, t is a fixed upper bound, t < Ew. The parameter k of this risk 
measure may be interpreted as a risk-parameter characterizing a kind of risk 
attitude. Values k > 1 describe a certain risk-sensitive, values k G (0, 1) a 
certain risk-insensitive behavior. 

Fishburn’s risk measure can be interpreted as a certain moment of the distri- 
bution of w. As the lower semivariance, it is a ‘lower moment’ characterizing 
the part of the distribution below the expectation. It is ‘partial’ because this 
part is only partially characterized. Because this characterization is relati- 
ve to the parameter t, Fishburn’s risk measure simply constitutes what in 
the literature now is called the lower partial moment (relative to t) oi the 
distribution of w. 

To measure the market risk of a portfolio of traded assets, banks are more 
and more employing internal models based on a methodology called Value- 
at-Risk. This methodology serves for the determination of the capital requi- 
rements that banks have to fulfill in order to back their trading activities. 
For a given time horizon and a confidence level 1 — a, the VaR of a portfolio 
is the loss in market value over the time horizon that is exceeded by the 
portfolio only with probability a. 
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Let r be the reference level with which the value of a given portfolio is 
compared at the end of the time horizon. If u; < r, there is a loss at the 
amount oi r — w. The portfolio’s loss is thus given by the random variable 

I r — w . (10) 

As reference level, initial wealth wq as well as expected wealth E{w) may 
reasonably be used. The probability of a loss lower than or equal to I is given 
by the distribution function 



i 

Fj{l):=P{l<l)= I fj{t)dt. (11) 

— OO 



Using the loss distribution Fj, for a given time horizon and a given confidence 
level 1 — q; (0 < a < a = 0.01), the p • 100% Value-at-Risk of the 

portfolio is the loss VaR = VaR^j^ implicitly defined by 

F^(VaR) - P{1 < VaR) = l-a. (12) 

This equation shows that, statistically speaking, the VaR-measure of a port- 
folio is the 1 — Of - 100%-quantile of the portfolio’s loss distribution. 
Applying the inverse distribution function Fj~^ to (12) yields the 1 — o - 100% 
Value-at-Risk of the portfolio explicitly through 

VaR - VaR(u)) Ff\l - a) . (13) 

Thereby, T|“^(l — a) is the value of the inverse distribution function Fj~^ at 
1 — a. 



4 Unified Perspective 

As early as at the beginning of the seventies. Stone (1973) has introduced two 
related three-parameter families of risk measures. His main goal was to show 
how the most common risk measures are related. The first three-parameter 
risk measure is defined as 

Rsi{w) := I I w; - p{F^) f dF^{w) {k>0), (14) 

— OO 

where p = p{F^) denotes a reference level of wealth from which deviations 
are measured. The positive number k specifies a power to which deviations 
in wealth from the reference level are raised and thus /c is a measure of the 
relative impact of large and small deviations. The parameter q = g(Fu)) is 
a range parameter that specifies what deviations are to be included in the 
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risk measure. The second three-parameter risk measure is defined to be the 
root of Rs\{w), i.e., 



q{F-w) 

Rs2{w) := [ / I w-p{F^) 1*= dF^{w)Y!>^ {k >0) . (15) 

— oo 

Through appropriate choices of the parameters p = p{Fw), q = q{Fw), and 
k it is easy to see that the above listed risk measures are special cases of 
one of Stone’s families. The variance (2) results from equation (14) and the 
standard deviation (3) from equation (15) by setting p = E{w), k = 2, and 
q = - 1 - 00 . The expected absolute deviation around p and around 0, (4) and 
(5) respectively, are special cases of (14) obtained by choosing p = p = E{w) 
and p = 0, respectively, = 1, and q = -t-oo. Equation (14) gives the lower 
semivariance (6) when k = 2 and p{F^) = q{Ew) = p\ it gives the expected 
value of loss (7) when p = q = 0, and k = 1. Stone’s family (14) amounts to 
the probability of loss (8) by setting k = 0, and q = r. Finally, this family 
yields the lower partial moments (9) by setting p = q — t. 

For any triplet (p, q, k) of parameter values, through both of Stone’s fa- 
milies of risk measures the risk of a given portfolio is characterized by a 
nonnegative number R = R{p,q,k). In fact, through both of these families 
of risk measures, the risk of a given portfolio is characterized by a quadru- 
plet (p, q, k, R) where R = R{p, q, k). Essentially, if any three of these four 
quantities are fixed the fourth quantity can be used as a (not necessarily 
nonnegative) real-valued indicator of the risk of a given portfolio. On the 
basis of that idea additionally, it can be shown that also the VaR-measure 
is a special case of Stone’s family (14). 

Equivalently to definition (12), the Value-at-Risk of a portfolio can be de- 
fined on the basis of the distribution of the value variable w instead of the 
distribution of the loss variable 1. According to equation (10) value results 
from loss through 

w — r — I . (16) 

Under this transformation, the loss VaR turns into the value r — VaR and 
for the value distribution function F^ holds 



F*(r - VaR) = 1 - E;(r - (r - VaR)) = 1 - F^(VaR) = a . (17) 

From this equation it can be seen that r — VaR is the (a) 100% -quantile of 
the portfolio’s value distribution. In other words, the Value-at-Risk measure 
also can be defined by r — w{a) where w{a) is the (a)100%-quantile of the 
portfolio’s value distribution. 

Furthermore, in the light of Stone’s family (14), equation (17) shows that 
the Value-at-Risk of a portfolio is implicitely defined by 

r-VaR 

a == F*(r - VaR) = J | u; - r |° U{w)dw . (18) 

— OO 




98 



From this equation one immediately recognizes that also the Value-at-Risk 
measure is a special case of Stone’s family (14). Value-at-Risk results from 
(14) if one chooses p = r, q = r — VaR, and A: == 0. Through this risk measure 
the risk of a portfolio is characterized by the pair {q; P{w < q)) =: [r — 
VaR; a). Thereby each component is a function of the other. For comparing 
different portfolios with regard to their risk, any of the two components could 
be used after having fixed the value of the other. In the banking practice, 
usually, a is fixed and VaR = VaR(a) is then used to characterize the risk 
of a portfolio. 

Thus it has been shown that all the risk measures reviewed in Section 3 are 
special cases of one of Stone’s families. For any risk measure, this embedding 
immediately discloses the features of this measure. It shows, e.g., that the 
variance indeeed takes into account the idea of a target return, namely by 
choosing p = E(u)), but that, by choosing q =■ -foo, all deviations from 
that target return irrespective of being above or below the target return, 
symmetrically, are taken into account. Also outcomes above the target return 
increase the risk. This contradicts the managerial notion of risk outlined in 
Section 2. 

This embedding also discloses the major features of the VaR measure. It 
takes into account the idea of a target return by implicitly choosing p = r. 
Contrary to the variance and in the sense of the managerial notion of risk, 
only deviations from the target return downwards are considered. Another 
advantage of the VaR measure is that by fixing the parameter a the risk of a 
portfolio is expressed in terms of value and is, therefore, easily to interpret. 
But, nevertheless, the VaR is a very rudimentary risk measure. Because the 
parameter k is set to 0, obviously, it contains no information on the loss 
distribution. The VaR user knows that a loss bigger than the VaR will only 
happen with a certain (small) probability. He has no information on, e.g., 
how large a very big loss can be and how probable it is. Contrary to the 
managerial notion of risk, the VaR measure does not increase if the amount 
of potential loss increases. 

5 Final Remarks 

The embedding of the well-known standardized risk measures in Stone’s 
families points out that three issues are to be considered in specifying a risk 
measure. The first is about what target level deviations are to be measured. 
The second is what the relative importance of large deviations compared to 
small ones is. The third is which of the deviations are to be counted. These 
three issues lead to the selection of a certain parameter vector (p, g, k) in 
Stone’s families. There are many different possibilities to select such vectors. 
In any case, managers have little inclination to equate a portfolio’s risk 
with its variance (cf., e.g., March and Shapira (1987)). In the light of the 
managerial notion of risk, the modern Value-at-Risk approach is much more 
adequate. But Value-at-Risk is a very rudimentary risk measure. It is just a 
first step in the right direction. 
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Abstract: Does the composition of the optimal portfolio depend on the planning 
horizon? According to popular opinion there exists a planning horizon effect if 
initial wealth has to be allocated between shares and the risk-free asset: the per- 
centage invested into shares should increase if the planning horizon is extended. 
The paper reviews the theoretical underpinnings of the statement. In the frame- 
work of expected utility the findings are mixed. Some results can be derived 
which contradict the popular opinion. But one can also find results which sup- 
port the popular opinion. The conclusions depend on the class of utility functions 
under consideration and on the alternatives to be compared. However, from the 
analysis of shortfall models strong evidence in favor of the popular opinion can 
be inferred. In addition, the optimal percentage invested in the stock market can 
easily be quantified. The well-known shortfall criteria of Roy, Kataoka, and Telser 
are studied in some detail. 



1 Introduction 

Two-fund separation suggests the analysis of portfolios which are structured 
as follows: A fraction w of initial wealth uq is to be invested into the stock 
market whereas the rest, i.e.the amount (1 — w)vq^ is to be invested at the 
risk-free rate r. Final wealth at planning horizon T stemming from such 
portfolios is 



Vt = + (1 - ( 1 ) 

where both the (annualized) risk-free rate r and the stochastic rate R[T) 
are compounded continuously. These portfolios are not only suggested by 
two-fund separation but offer the following advantages: 



• They are very easy to implement. Investment into the stock market 
could be realized by purchasing index certificates or index funds, in 
Germany for instance DAX-Participations, DAX-Cititraks etc. 



• They provide broad diversification at low cost. 
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• No reallocation from now {t = 0) until the planning horizon T is 
required. Therefore, the investor does not need to incur additional 
transaction costs. 



To make the last point sufficiently clear: The portfolio made up at t == 0 is 
“frozen in” . Only u;, the initial fraction invested into shares, is subject to op- 
timization. Note that the riskily invested fraction increases with time if the 
stock market outperforms the risk-free rate. Therefore, we do not consider 
“constant proportion portfolios” which require permanent readjustments to 
keep the riskily invested fraction constant through time. And of course we do 
not deal with Ramsey-type models, studied, for instance, by Phelps (1962) 
or Samuelson (1969). Therefore we need not bother about ingredients like 
intertemporal consumption and investment decisions, the specification of a 
subjective time preference rate, the bequest function at the planning horizon 
etc. 

Classical (= Markowitz) portfolio selection gives no hints about time horizon 
effects. The single risky assets and the portfolios have rates of return whose 
expected value and variance are proportional to time. Therefore, planning 
horizon aspects cancel out. The typical one-period portfolio selection model 
is applicable to periods of every given length. But there is one remarkable 
exception. Burkhardt (1997a, 1997b) pursues a kind of dual approach in 
which time aspects are most important. He starts with a given target (final 
wealth) and considers the time to achieve the target as the relevant random 
variable and the unique source of risk. The portfolio composition should 
entail a “most favorable” time distribution. Since risk is measured in time 
(instead of end-of-period wealth) new types of preferences are required. 
For details, similarities and differences to the classical approach reference is 
made to the above mentioned papers. 

The popular opinion quoted in the Abstract can be spelled out the following 
way: “The optimal fraction w is an increasing function of T.” Without 
doubt there is overwhelming empirical evidence in favor of it. Therefore, 
the opinion is widespread among people who are well-informed about capital 
markets. However, the majority of German people does not seem to believe 
in this statement since they prefer to invest their money either risk-free or 
at low risk, even if the planning horizon is rather long. Let w{T) denote 
the optimal fraction for given planning horizon T. In order to give the term 
“optimal w{Ty^ a proper meaning we have to make assumptions about share 
price dynamics (Section 2), and to define a framework for decisions under 
risk. Section 3 addresses the standard case, the expected utility framework. 
A nonstandard framework, which currently meets a good response in theory 
and practice^ is the subject of Section 4. Several shortfall models will be 
studied in some detail. 



^See Albrecht (1994), Kaduff and Spremann (1996), Schubert (1996), Tse et al. (1993), 
Wolter (1993), Zimmermann (1991,1993). 
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2 Share Price Dynamics 

The assumptions concerning the stock price (index) S{t) are exactly those 
of the Black-Scholes model. Therefore, we only have to clarify the notation 
and the magnitude of parameters. 

S{t) = 5(0)6^^^^ is a geometric Brownian motion with drift fj, 

is an arithmetic Brownian motion with drift ^ 

R{t) is the continuously compounded rate of return of the 

stock market and a denotes volatility. 

In the sequel we only need the marginal distributions with respect to S{t) 
and R{t). 

For fixed t it holds: 

S{t) is lognormal with E(5(t)) = S{0)e^^ (2) 



R{t) is normal with E{R{t)) = (/i — and Var(i?(^)) == a^t (3) 



And final wealth, defined in (1) as 

Vt = Vowe^^'^^ + Uo(l — w)e^'^ 



is shifted lognormal. 

To get a crude (annualized) estimate fi we use (2) by setting 

S{t) = DAXt 
t = April 30, 1998 
0 = ultimo 1987 

and equating E[5(t)] with realized S{t). We obtain 

5241 = 1000 • or fi = -^ln5.241 « 16% 

iU.OO 

With respect to volatility we start out from the typical estimate a = 20%. 
In order to prevent a bias in favor of the stock market we use 

/X = 12% , cr = 20% , r = 3% (4) 

for most of the following examples. In particular this yields^ 



^See for instance Stehle et al. (1996). 
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^ 0.12 — ^ = 10% for the drift (= expected one-year rate of 

return) of R{t) 

^ — r = 7% for the (expected) excess return. 

With respect to the last number (7%) only the positive sign is essential; the 
conclusions drawn in the next two sections do not depend on the (positive) 
numerical value. The fact that the excess return is positive is backed up 
empirically and results from the risk aversion of investors. 



3 Expected Utility 

The fraction w invested into the stock market defines a one-parametric set of 
portfolios. Aside from u; = 0 and w = 1 final wealth (1) is shifted lognormal. 
The two corner portfolios 

w = 0 \ invest only risk-free 
w = 1 : invest only into the stock market 

lead to distributions which are easier to handle. Since w = 0 entails a 
deterministic and w = 1 a. (nonshifted) lognormal final wealth. Therefore, 
we will start with the comparison of the corner portfolios. 

3.1 Comparison of Corner Portfolios 

Suppose, for instance, the von Neumann-Morgenstern utility function to be 
logarithmic. Then the risk-free investment {w = 0) leads to final wealth 
and to expected utility 

ln(i;oe^^) = In(t^o) + rT, (5) 

whereas the risky investment leads to expected utility 

E[ln(uoe''(^)] = In^) + E[i?(T)] i ln{vo) + (/x - y )T. (6) 

From (6) one can see that risk aversion (to be more precise: proportional risk 
aversion equal to 1) with respect to final wealth stemming from investment 
into the stock market is equivalent to risk neutrality with respect to the 

rate of return. Furthermore, one can see that the comparison between the 

2 

corner portfolios boils down to the comparison of r with /i — Obviously, 
parameter values like those in (4) yield a preference for the risky investment. 
In particular, the preference valid for T == 1 holds for any planning horizon. 
The following theorem, which is due to Samuelson (1963), generalizes this 
observation. 
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Theorem 1 (Samuelson) If 

u[ve^] > 

(risk-free alternative is preferred, given T = 1) is valid for any v > 0, then 
it holds 

u[ve^'^] > E[u{ve^^'^^)] T = 2,3, 

The theorem states that preference reversal in favor of the risky alternative 
is impossible if the riskless alternative is preferred on a one-period basis. 
Samuelson expressed the theorem in terms of abstention {= risk-free alter- 
native) and repeated bets and the proof in terms of utility metric, money 
metric etc. For the sake of reference the proof has been translated into 
the notation of this paper (see Appendix 1). It shows that no assumptions 
concerning the shape of the utility function or the distribution of the one- 
period returns are required. Only the i.i.d. property is essential. This could 
be interpreted as robustness of the theorem. The other side of the coin is 
the restrictiveness of the if-clause. Setting u{x) = Inx we know from (5) 
and (6) that the if-clause is only valid for r > /i — ^ which is very unlikely 
on empirical grounds. Thus the if-clause implicitly rules out certain util- 
ity functions or imposes unrealistic constraints on parameter values. This 
should be seen clearly and kept in mind to prevent overinterpretations. After 
all the theorem triggered out the controversy about the theoretical validity 
of the popular opinion. And indeed, every paper which aims at questioning 
the theoretical underpinning refers to Samuelson’s theorem; compare, for 
instance, Bodie et al. (1989, p. 220-226), Kritzman (1994). 

Up to now we used final wealth as argument of the utility function. If Vq 
is less than the true initial wealth but merely an amount which is at the 
investor’s disposal, then “final wealth” must not taken literally; it is only a 
certain increment which contributes to the true final wealth. It is a matter of 
opinion which is the most legitimate argument of the utility function: final 
wealth, increments, rate of return etc. Bamberg and Lasch (1997) considered 
the average rate of return instead of final wealth as driving force and 
subject to expected utility, combined it with constant risk aversion p > 0 
and derived the result: 

If the planning horizon T is below the threshold 




then it is optimal to invest risk-free. For T > T it is optimal 
to invest into the stock market. For T = T there is indifference 
between the two corner portfolios. 

The result strongly supports the popular opinion and seems to contradict 
Samuelson’s theorem. However, little algebra (Appendix 2) reveals that the 
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assumption of constant absolute risk aversion p with respect to average rate 
of return is equivalent to constant proportional risk aversion 1 + |; with 
respect to final wealth. Thus risk aversion with respect to final wealth is 
(slightly) decreasing with the planning horizon T. Samuelson’s theorem 
excluded utility functions which depend on T. 

Just the reverse is true if we start from utility functions used in Samuelson’s 
theorem (defined on final wealth and independent of T). These utility func- 
tions imply risk aversion with respect to to be an increasing function 
of T which is the actual reason to prevent the preference reversal. 

3.2 Optimal Fraction w(T) 

Now we will turn to the analysis of the optimal fraction w{T) G [0,1]. 
Expected utility of final wealth, 

U{w,T) — +^’o(l — (7) 

has to be maximized with respect to w. As usual, u is assumed to be strictly 
increasing and strictly concave, i.e. 

1 ^' > 0 and u” < 0. 



From the derivatives 

vqE[u' { vowe^^'^^ + 't’o(l — — e^^)] 

Vou'{voe^'^){e^'^ — >0 if p. > r 

VqE[u ” -h x^o(l “ < 0 

one can draw the conclusions: 

Conclusion 1: If the drift parameter p exceeds the risk-free rate r it can 
never be optimal to invest all the money at the risk-free rate. This result 
holds, no matter how big the investor’s (finite) risk aversion is. 

Conclusion 2: Expected utility U{w,T) is a strictly concave function of w. 
Especially, w{T) is uniquely determined. 

The further analysis will be based on the important and frequently used 
KARA class of utility functions. KARA (= hyperbolic absolute risk aver- 
sion) is defined as 



dUjw,T) 

dw 

dU{w,T) I 
dw 1^=0 



d'^Uiw,T) 

dw"^ 



v^-p 

l-p 

Int? 



1 

p- 1 



u{v) — 
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where the positive parameter p is proportional risk aversion and n > 0 
denotes final wealth. Absolute risk aversion ^ is nonconstant and hyperbol- 
ically declining with wealth v. 

The optimal fraction rc(T) is an interior solution iff /x > r (which excludes 
w{T) = 0) and, in addition, 



dU{w,T) 

dw 



\w=l 



< 0 . 



( 8 ) 



Since HARA-utility functions have the simple derivative u'{v) — in- 
equality (8) can easily be evaluated. It turns out that (8) is valid iff 



P> 



II — r 



G 



2 • 



Our standard numbers (4) satisfy // > r and yield 



( 9 ) 



p-r _ 0.12 - 0.03 _ 

- oToi “ 



That means: 

Case 1: If proportional risk aversion p is greater than 2.25, then no corner 
portfolio can be optimal; w{T) G (0, 1) for all T > 0. 

Case 2: If p is less or equal 2.25, then w{T) = 1 for all T > 0. 



Figure 1: The optimal fraction w{T) to invest into the stock market is an 
increasing function of the planning horizon T 
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Case 2 applies, for instance, to the logarithmic utility function, where p = 1. 
Unfortunately, it seems impossible to derive a closed form for w{T) in case 1. 
But numerical evaluations show that w{T) is increasing. This gives support 
to the popular opinion and is depicted in Fig. 1. 



4 Shortfall Models 

From time to time shortfall models (= downside risk models) enjoy a revival. 
These days they are very popular indeed, maybe stimulated by the value- 
at-risk discussion. For a comprehensive bibliography see for example the 
references given by Kaduff (1996). Advocates of shortfall models argue that 
many private and institutional investors aim at achieving a certain target 
level given in terms of wealth or rate of return. Therefore, the probability of 
not achieving this level in a predetermined time period is a matter of concern. 
Moreover, they argue investors are more likely to understand and to specify 
the required ingredients (T, T:^,q;) than a von Neumann-Morgenstern utility 
function. 

As indicated, we use the notation 

= target or benchmark (rate of return) 
a — (tolerated) shortfall probability G (0, 1). 

To avoid annoying case distinctions we assume ^ r. 

Shortfall occurs if 



vo[we^^ ^ + (1 — w)e^^] < vqC 

A shortfall probability equal to a is equivalent to 



.r.T 






P R{T) < In- 



(1 — w)e 



rT 



W 



= a. 



(10) 



( 11 ) 



provided that u; > 0. The fraction w{T) compatible with (11) is uniquely 
determined and given by 



w{T) = 



exp[(r* — r)T] — 1 



(12) 



exp[zaCrVT + (p - ^ - r)T] - 1 ’ 

where is the a— fractile of a standard normal distribution. If fractions 
w{T) > 1 are admitted we have to adopt the assumption 



borrowing rate = lending rate = r. 

Formula (12) has been studied comprehensively by Bamberg and Lasch 
(1997); the analogue in terms of discrete (instead of continuously corn- 
pounded) rates of return has been discussed by Zimmermann (1992). Here 
we will add some supplements. First, we distinguish the two cases r* > r 
and < r. 
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Case 1: (High aspiration level): > r 

Obviously, it is impossible to have a high aspiration level (of, say, the double 
of the risk-free rate) and at the same time to stipulate low risk a and a short 
planning horizon T. Graphically the situation is represented by Fig. 2. 



Figure 2: Low shortfall probability a and short planning horizon T are not 
compatible with a high aspiration level r^> r 




Feasible settings are determined by w(T) > 0, i.e. 



ZaCrVf + (m - y - r)T > 0, 



Zq > — 



(M-f-r) 



a_ 

2 

a 



v^, 



a > $ 






a 



where $ denotes the distribution function of A^(0, 1). Hence the borderline 
between the two regions in Fig. 2 is 



/ (y-f-r)y/r \ 



a = $ 



a 



(13) 
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In particular, a = 50% may be combined with any planning horizon T. The 
typical behavior of w(T) is depicted in Fig. 3. 



Figure 3: The optimal fraction w(T) increases with T if the target rate r* is 
very high. If the target rate is fixed within the interval (r,ju — one can 
afford to reduce w(T) for longer time horizons T 




Case 2: (Low aspiration level): < r 

The situation shown in Fig. 2 is now just reversed. One has to interchange 
the terms “feasible” and “unfeasible” ; the borderline remains the same. For 
every a < 0.5 we get a maximum planning horizon (intersection of the 
level a with the borderline) such that 



T <Ta : optimal portfolio has shortfall probability equal a 
T > Ta : optimal portfolio has shortfall probability lower than a. 



Taking the shortfall criterion literally we have to restrict on T < to get 
graphs like those in Fig. 4 which are strongly consistent with our popular 
opinion. 




no 



Figure 4: Graphs of the optimal fraction w{T), given the shortfall probability 
a = 2.5% and /x, a, r as in (4) 




Case 2 looks rather unnatural in the light of the fact that planning horizons 
T beyond lead to portfolios with shortfall probability smaller than the 
prescribed probability a. Risk averters could argue: Why taking unnecessary 
risk at all? This objection is taken up in the well-known criteria of Roy, 
Telser and Kataoka. In our framework we get the following results: 

• Roy (1952) proposed minimizing the shortfall probability. The result- 
ing safety-first portfolios are 

If r* < r : invest only risk-free, i.e. w{T) = 0 
If r* > r : invest into shares as much as you can, i.e. w{T) = w, 
where w <w investor’s borrowing constraint 

The latter assertion stems from the fact that the left-hand side of (11) 
is a declining function of w. 

• Telser (1955) proposed maximizing expected final wealth E(Vr) subject 
to the constraint: Shortfall probability < a. 

The criterion primarily aims at justifying case 2. It quantifies the 
chances through E(Vt) and compares them with the risk taken. There- 
fore we will restrict on case 2, i.e. r* < r. Since 



E{Vt) = Vo • w{e^'^ — 




Ill 



is an increasing function of the fraction w and w is an increasing func- 
tion of a we end up with the advice: The admitted shortfall probability 
should be exhausted to the full. In this sense case 2 can be justified. 

• Kataoka (1963) proposed starting with an upper bound a G (0, 1) for 
the shortfall probability and looking for portfolios which maximize the 
(target) rate r*. 

From (11) (interpreted as inequality) we obtain 



“ (1 ~ w)e^^ 

In ^ ^ 

w 



< u, 



where is the a— fractile of R{T). The inequality can be transformed 

into 




In 



+ W ^2 )T+a\/Tza _ ^rT 



(14) 



Again we need a borrowing constraint w < w in order to get a finite 
solution. If the factor of w in (13) turns out to be positive, the right- 
hand side is maximal if w = w] otherwise it will be maximal if if; = 0. 
Hence Kataoka gives the advise: 



If a > $ ^ — —VT^ • invest as much as you can into shares 

(w{T) = w) 

If a < $ ^ — -y/r^ • invest only risk-free {w{T) = 0) 

Graphically, the situation is like in Fig. 2 if “feasible” is replaced by 
w{T) = w and “unfeasible” is replaced by w{T) — 0; the same border- 
line applies. The maximal rate r* is given by the right-hand side of 
(13) with w = 0 OT w = w inserted. 



5 Concluding Remarks 

Models based on expected utility or on a downside risk measure have been 
presented. Is it possible to implement such models as a decision-support 
tool? In principle it is possible but there arise problems assessing the re- 
quired data: the utility function u or the tripel (r*,a,T). Downside-risk 
parameters of higher order have not been considered since they would ag- 
gravate the assessment problem even more. Maybe methods of classification 
could be useful to tackle the assessment problem. 
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As to the popular opinion more pros than cons have been presented. On an 
after tax basis the popular opinion gets an additional comparative advantage 
(at least according to German tax laws). If we allow a kind of mean reversion 
instead of our stock price dynamics, further support can be found in favor 
of the popular opinion (see, e.g. Samuelson (1989, 1991), Kritzman (1994)). 
On the whole, theory seems to provide more evidence in favor of the popular 
opinion than against it. 



Appendix 1: Proof of Samuelson’s Theorem 



Let i?i, i? 2 , . . . , be the i.i.d. rates of return for period 1, 2, . . . , T. Then 
we have 
i?(l) - Ri 

R(t) = ^ (convolution of t rates of return). 

The proof works with induction over T. We start by recalling the if-clause. 



u[ve^] > E{u[ve^^^^]) for all u > 0. 
From the assumption that 

u[ve^^] > E{u[ve^^^^]) for all u > 0 
be valid for a specific t (induction hypothesis) one concludes: 



t-\-l 

< ER,_^,[u{ve^e’'*)\Rt+i = x] 

E{u[ve^^+^ e^^]) 

< u[ve'^^ • e^] = 



(property of convolutions) 

(law of iterated expectations) 

(induction hypothesis 
applied on ve^ instead of v) 

{Rt-^i has the same distribution 
as /2(1), if-clause applied on ve^^) 



This completes the proof. 



Appendix 2: Conversion of Utility Functions 

Final wealth V and average rate of return X = are related through 

V = voe^'^ , X = lln(— ) 

I Vo 

The utility function u(x) defined on realized average rates of return x and 
the utility function u(v) defined on realized final wealth v lead to the same 
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ordering of risky alternatives if and only if there exist a G IR and /? > 0 such 
that 



u{v) = a-\- pu{x{v))^ 



where x{v) = |=ln(^). 

Especially if constant absolute risk aversion p > 0 with respect to average 
rates of return is assumed, i.e. 

u{x) = — exp(— px), 



we get 



u{v) = a - pexp{—^\n{—)) = a - p{—) t 
1 Vo Vo 



which is equivalent to 



u(^V^ =z —y T = —y^ (1+f). 

The last expression reveals that u has constant proportional risk aversion 
equal to 1 + |?. 
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Abstract: The paper presents a short overview of general ideas in the analysis 
and management of financial risk. The emphasis is put on the application of 
quantitative methods. First of all, some historical remarks are given. Then the 
difiFerent concepts of understanding are discussed. The review of commonly used 
statistical measures, where risk is understood as volatility and as sensitivity is 
given. Finally, the general method of risk management is presented. This method 
aims at keeping the sensitivity of portfolio at a desired level. 



1 Some Historical Remarks on the Develop- 
ment of Statistical Methods in Finance 

Statistical approaches in financial risk analysis are strictly related to the 
development of modern finance theory. It should be mentioned that statis- 
ticians, econometricians and mathematicians contributed significantly to 
the emerging of modern finance theory, which is based on quantitative ap- 
proaches. There are two streams in the development of statistical methods 
in finance. 

The first stream contains the methods derived from the general notion of re- 
turn. The main driving force of the development of this stream was the idea 
to derive methods of forecasting financial prices, which applied on financial 
market could result in extraordinary returns. Among many different groups 
of these methods we may mention: time series analysis methods based on 
stochastic processes theory, including well known family of ARIMA-type 
methods and ARCH- type methods (see e.g. Mills (1993)) and the meth- 
ods based on deterministic chaos theory (see e.g. Peters (1991)). The latter 
group gained a lot of attention in last several years, since chaotic time series, 
generated by a deterministic recursive function (which depends strongly on 
initial conditions) may look similar to and may have the same statistical 
properties as time series being the realizations of stochastic processes. 

The origin of this stream can be traced back to the famous work of Louis 
Bachelier (Bachelier (1900)), who worked on models of financial prices and 
derived Brownian motion process. 

The second stream contains the methods derived from the general notion 
of risk. This paper refers to these methods. The main driving force of 
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the development of this stream was the idea to derive methods to hedge 
(to protect) against the increasing risk on the financial market. This group 
contains, first of all, two methods considered as landmarks in finance the- 
ory, namely: portfolio theory (Markowitz (1952), Markowitz (1959)) and 
option pricing theory (Black and Scholes (1973), Merton (1973)). These 
two approaches are based on roughly similar idea of the reduction of risk. 
The authors (Markowitz in 1990, Scholes and Merton in 1997) were awarded 
Nobel prize in economic sciences. 



2 Understanding of Risk 



The concept of risk is a very general one, thus there is a lot of misunder- 
standing connected with this concept and many different ways to approach 
the problem of financial risk. We are going to discuss three basic problems 
related to understanding financial risk. 

The first one is the problem of defining financial risk by the people who 
analyze and manage this risk. Here two basic meanings of risk can be found: 

• downside risk - risk is understood as danger of loss (threat); 

• two-sided risk - risk is understood as deviation from expectations, 
therefore it is on one hand threat, on the other hand opportunity. 

These two meanings are reflected in different statistical measures of risk. 
The second problem is related to the question what is taken into account 
when risk is measured and the decision is being taken. There are two main 
components: 

• objective risk, where the uncertainty of financial market is reflected, 
here statistical measures of risk are used (discussed later in this paper); 

• subjective risk, where the attitude towards risk of particular person is 
reflected. 



It is worth to mention that subjective risk involves psychological issues and 
results from behavioral theory of finance. Until now, the most common is 
utility theory (von Neumann and Morgenstern (1944)), where utility func- 
tion is being used to measure psychological satisfaction of the person who 
has particular level of wealth. It is claimed that people make investment 
decisions on the financial market in such a way that expected utility is max- 
imized. It can also be proved that particular form of utility function reflects 
attitude towards risk, for example concave utility function characterizes a 
risk averse person. 

These ideas were used to propose the following measures of risk aversion: 



• absolute risk aversion (Pratt (1964), Arrow (1965)): 



AR{w) = — 



u”{w) 

u'{w) 



( 1 ) 
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• relative risk aversion (Pratt (1964), Arrow (1965)): 



RR{w) 



wu"{w) 

u'{w) 



( 2 ) 






partial risk aversion (Zeckhauser and Keeler (1970), Menezes and Han- 
son (1970)): 



PR[w) = - 



wl wu”{w) 
w u'{w) 



( 3 ) 



where: 

AR{w) - absolute risk aversion; 

RR{w) - relative risk aversion; 

PR{w) - partial risk aversion; 
w - level of wealth; 

wl - level of wealth, which will be retained with certainty (risk-free part of 
wealth) ; 

u{w) - utility for a given level of wealth; 

u'{w) - first derivative of utility function for a given level of wealth; 
u"{w) - second derivative of utility function for a given level of wealth. 

The following interpretation can be given for these measures: 

1. Increasing absolute risk aversion means that while the wealth of in- 
vestor increases, the amount of funds spent on risky investments de- 
creases. 

2. Increasing relative risk aversion means that while the wealth of investor 
increases, the fraction of funds spent on risky investments decreases. 

3. Increasing partial risk aversion means that while the wealth of investor 
increases and only part of this wealth is subject to risk, the fraction of 
funds spent on risky investments decreases. 

Utility theory is the most common approach to analyze subjective aspects 
of risk, however it has some drawbacks, namely it was shown that some of 
its axioms are not justified in some practical situations. Therefore other 
approaches were proposed. The most significant one was prospect theory 
(Kahneman and Tversky (1979)), where one of the conclusions is that rather 
than being risk-averse, investors are loss-averse. 

The third problem related to understanding of risk is the one that there 
are different types of financial risk and each of them is analyzed in different 
way. From the point of view of statistical methods, two types of risk are of 
particular interest: 

• market risk - this is risk resulted from the changes of financial market 
prices (e.g. interest rates, exchange rates, stock prices, commodity 
prices) - can be understood as downside risk or two-sided risk; 
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• credit risk - this is risk resulted from the fact that counterparty in 
the contract traded on the market can default - this type of risk is 
understood as downside risk. 

There are however, other types of risk, more difficult to be quantified, like 
legal risk, political risk or operational risk. Since these types of risk are 
understood as categories, they can be analyzed by introducing qualitative 
(discrete) variables. 



3 Statistical Measures of Risk 

There are many statistical measures of risk used in finance, particularly in 
the financial applications. In most of them only objective side of risk, reflect- 
ing the uncertainty of market, is considered. These measures can be classified 
into three groups. The first one contains categorical measures, where the 
level of risk depends on the probability of default. They are used to analyze 
credit risk. Two other groups of measures are used to analyze market risk. 
In the second group market risk is understood as volatility (variability) of 
returns (or prices). In these measures the distribution of returns (or prices) 
is taken into account and the measure of risk is the characteristics of this 
distribution. The most common used are: 

• standard deviation (or variance); 

• semi-standard deviation (or semi- variance) , defined as: 

55 - .jE{{R-E{R))-r (4) 

where: 

R - return; 

Ss - semi-standard deviation; 

E{R) - expected value of returns; 

{R — E{R))~ - means taking difference from expected returns, if this 

difference is negative and taking 0, if this difference is non-negative; 

• safety level, defined as: 

P{R <Rs) = a (5) 

where: 

Rs - safety level; 

P - probability; 

a - value of probability, taken as a small number; 

• probability of not achieving required return, defined as: 

P{R <RR) = a (6) 



where: 

RR - required return, assumed by investor. 
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Of course, there are many other possibilities, since many spread measures 
can be used (including robust measures). It is worth to mention that recently 
a so called Value at Risk (VaR) as very adequate measure of downside risk 
is advised for financial institutions (see e.g. Jorion (1997)). It can be proved 
that Value at Risk is the function of safety level (can be interpreted as the 
difference between the value of investment and safety level). 

The objective and subjective components of risk can be combined in the 
concept of generalized risk (Rothschild and Stiglitz (1970)). It can be 
defined in the following way: 

Investment A is less risky than investment B, if and only if 

Fa{R) < Fb{R) (7) 

for each value of return and this inequality is strict for at least one value of 
return. Here: 

Fa{R) - cumulative distribution of returns for investment A (similarly for 
B). 

This concept combines the objective and subjective risk, since the following 
interpretation can be given: 

If two investments have the same expected return and differ with respect to 
risk, investment A is less risky if it is chosen by all investors characterized 
by concave utility function, that is by all risk-averse investors. 

The following example gives the illustration of differences between general- 
ized risk and standard deviation of returns, the latter taking into account 
only objective side of risk. 

Example. 

Given two investments, described by the distributions of final wealth: 
Investment A: 0 with the probability 0.5; 4 with the probability 0.5; 
Investment B: 1 with the probability 0.875; 9 with the probability 0.125. 
Expected return of both investments is equal to 2. Standard deviation of 
returns of A is equal to 1.41, standard deviation of retuns of B is equal to 
2.65. Therefore, if risk is measured by standard deviation (or variance), in- 
vestment A is less risky and should be preferred. 

Now we take expected utility maximization as a criterion of investment 
choice and we consider two concave utility functions (characterizing two 
risk-averse investors): 

1. For utility function: 
u = 

we get the expected utilities: 

E{ua) = 1, E{ub) = 1.25. 

Thus investment B should be preferred by this risk-averse investor. 

2. For utility function: 
u — —uF + 20w 

we get expected utilities: 
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E{ua\ = 32, E{ub) = 29. 

Thus investment A should be preferred by this risk-averse investor. 

As it can be seen, different concave utility functions result in different de- 
cisions. This is so because here generalized risk does not lead to a unique 
solution - the inequality given in its definition is not satisfied (cumulative 
distribution functions for both investments intersect). 

In the third group market risk is understood as sensitivity of return (or 
price) with respect to the changes of some underlying factors influencing re- 
turns (or prices). To apply this approach, a model of dependence of return 
(or price) on underlying factors is determined. The most commonly used 
sensitivity measures of financial risk are: 

• beta (sensitivity of return on stock with respect to the changes of stock 
market index), given as: 



R — O' /3R][{ -j- 6 



( 8 ) 



where: 

(3 - beta coefficient (sensitivity measure); 
R - return on stock; 

Rm - return on stock market index; 
e - random component; 



• sensitivities of returns with respect to the changes in risk factors in 
Arbitrage Pricing Theory (see Ross (1976)), given as: 



R — a -\- hiF\ 62-^2 + . . . “h bjnFrm + e (9) 



where: 

bj - sensitivity coefficient with respect to the j-th risk factor; 
Fj - j-th risk factor; 



• delta and gamma coefficients (sensitivities of option value with respect 
to the changes of price of underlying instrument), given as derivatives: 



, dC 

^=dp 

(PC 

dP^ 



where: 

S - delta coefficient; 

7 - gamma coefficient; 

C - price of an option; 

P - price of underlying instrument (e.g. stock); 



( 10 ) 

( 11 ) 
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• duration and convexity (sensivities of bond value with respect to the 
changes of interest rates), given as functions of derivatives: 



where: 

D - duration of a bond; 
C - convexity of a bond; 
P - price of a bond; 
r - interest rate. 



D = 






1 dP 


(12) 


r dr 


d:^p 


dr‘^ 


(13) 



4 Risk Management 

Risk management is one of the most important activities on the financial 
market. There are several approaches to risk management, some of them use 
statistical approaches. The oldest one, where risk is measured as a volatility 
(standard deviation as a measure of risk is applied), is already mentioned 
portfolio theory. Recently, the other approach, where risk is measured via 
sensitivity measures, is often adopted. We will give the general description 
of this approach. 

First of all, a so called risk profile of investment is derived, given as: 

p = /(Xi,X 2 ,...,x„,y) 

where: 

P - value of investment; 

Xi - value of the i-th risk factor; 

V - so called volatility matrix, being a covariance matrix of all risk factors. 
If volatility matrix is treated as constant and small changes of underlying 
factor are considered, we have the following equation refiecting sensitivity of 
value with respect to the changes of factors: 

/(Xi + AXi, X2 + AX2, . . . , X„ + AX„, V) - 
/(Xi, X2, . . . , X„, F) + FiAXi + FijAXiAXj 

i=l i=l j=l 



where: 

AXi - change of the z-th risk factor; 

Fi - first derivative of value with respect to the z-th risk factor; 

Fij - second derivative of value with respect to z-th and j-th risk factors. 

In general, volatility matrix changes over time, so the following characteris- 
tics should also be taken into account: 
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• delta coefficient, given as: 



dP 


(14) 


II 

SI 


• gamma coefficient, given as: 




d?P 


(15) 


~ dXidXj 



• vega coefficient, given as a derivative with respect to a particular ele- 
ment of the volatility matrix: 



dP 



“ du, 



(16) 



Risk management problem is the task of building portfolio of financial in- 
struments, in such a way that sensitivities of value of this portfolio are kept 
at desired levels. This leads to the following optimization problem: 
Minimize 



n n 

^ ^ Ol2i{^pij TOij) ^Qij) ] 

i=l j=l 

where: 



N N N 

^P'i ~ Ipij — '^k^kij') ^pij — '^k^kij 

k=l k=l k^l 



subject to: 



N 



Y.^kPk = p 

k=l 



'^Ik <Wk< Wuk 



where: 

- weights of the importance of different sensitivities assigned 

by manager; 

n - number of risk factors; 

N - number of components in a portfolio; 

Aozj - desired sensitivities (delta, gamma and vega) of value of port- 
folio. 

The objective function measures the total difference between the actual and 
the desired sensitivities of portfolio. The constraints reflect the value of 
portfolio (budget constraint) and the number of instruments of particular 
type which can be included in a portfolio. 
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Abstract: Interval-probability (IP) is a substantial generalization of classical 
probability. It allows to adequately model different aspects of uncertainty without 
loosing the neat connection to the methodology of classical statistics. Therefore it 
provides a well-founded basis for data-based reasoning in the presence of uncertain 
knowledge. — The paper supports that claim by outlining the generalization of 
Ney man- Pearson- tests to IP. After introducing some basics of the theory of IP 
according to Weichselberger (1995, 1998) the fundamental concepts for tests are 
extended to IP; then the Huber-Strassen-theory is briefly reviewed in this context 
and related theorems for general IP are given. Finally further results are sketched. 



1 Introduction 

The appropriateness of probability for modelling uncertain knowledge in 
artificial intelligence has more and more often been questioned. The usual 
formalization of probability as a non-negative, normalized and (a— )additive 
set-function - called classical probability in what follows - requires a very 
high degree of precision and internal consistency of the information avail- 
able; it has implicitly as paradigms situations, which are equivalent to fair 
games in idealized gambling situations. There is a large degree of consensus, 
however, that uncertain knowledge in artificial intelligence contains aspects 
of uncertainty, which are quite diflferent from the uncertainty arising in ideal 
gambles. Therefore classical probability is judged to be too unflexible to 
express all relevant aspects of uncertain knowledge. 

Several calculi (like the MYCIN-method or fuzzy logic) emerged as much 
more flexible alternatives to classical probability. With the separation from 
probability the connection to statistics lost its basis and had to be given up. 
A plenty of proposals for data-based, inductive reasoning in the framework 
of alternative concepts have been made, but a general methodological con- 
sensus seems to be still out of sight. Often a case-based ad-hoc methodology 
is said to be the best, one can hope for.^ 

A different contribution to the methodological debate might be based on 
the thesis that the concepts of probability and statistics have been rejected 

^ “During the 1980s, I came to agree with Piero Bonissone’s view that managing uncer- 
tainty in artificial intelligence means choosing, on a case-by-case basis, between different 
formalisms and different ways of using these formalisms.” Shafer’s foreword to Yager et 
al. (1994, p. 2.) 
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prematurely. If one succeeds in generalizing the formalization of probability 
in a way that provides a calculus, 

• which additionally allows for modelling aspects of uncertainty dif- 
ferent from ‘idealized randomness’ as characterizing ideal gambling- 
situations, 

• and keeps a neat connection to classical probability, 

then an appropriate formalism for modelling uncertain knowledge would be 
generated, to which the methodology of classical statistics may be adopted. 
A sound and comprehensive intersubjective framework for statistical rea- 
soning and decision-making in the presence of uncertain knowledge would 
become possible. 

This paper argues in favour of the claim that these aims are fulfilled by the 
concept of interval-probability. After a short introduction (§2) to some basic 
aspects of the general theory of interval-probability according to Weichsel- 
berger^, in the third section the paper states the alternative-testing prob- 
lem in its general form. Then it outlines, how the connection to classical 
probability and statistics can fruitfully be exploited: First the fundamen- 
tal concepts for tests are extended to interval-probability. Second (§4f.) 
proper results on existence and construction of bests tests are derived. The 
Huber-Strassen-theory will shortly be reviewed and interpreted in the con- 
text considered here, and related theorems for general interval-probability 
will be developed. At the end some selected further results will be sketched. 



2 Some Basics of Interval- Probability 

The concept of interval-probability assigns an interval [T(A); U (A)] - instead 
of a single real number p{A) - to each event A as its probability. By the width 
of the interval a second dimension for modelling uncertainty is introduced, 
which describes the quality of the information and the amount of knowl- 
edge the probability assignment is based on. The stronger the evidence, 
the smaller the width of the interval. — The intuitively clear idea of us- 
ing interval- valued probability can be rigorously formalized. Weichselberger 
(1995, 1998/99) developed a comprehensive theory of interval-probability 
based on a general and interpretation-independent axiomization. This can 
be gained by taking Kolmogorov’s axioms as a basis and extending them in 
an appropriate way: 

Def. 2.1 Let be a measurable space. 

• A function p(-) on A fulfilling the axioms of Kolmogorov is called 
K-probability or classical probability; the set of all classical proba- 
bilities on (fi, vA) will be denoted by JC 

• A function^ P(-) on A is called R-probability with structure M, if 

^See in particular: Weichselberger (1995, 1998/99). 

^Like in this definition, throughout the paper capital-letter P will be used for interval- 
valued assignments and small letters p,q,. . . for classical probability. 




129 



1. P(-) is of the form P(-) : A ^ Zq := {[L; P] |0 < L < P < 1} 

A P{A) = [L{A);U{A)] . 

2. M ~ {p(-) G A) 1 L{A) < p{A) < U{A), ^AeA}^^. (1) 

• R-probability with structure A4 is called F-probability, if 

VA G ^ : inf p{A) - L{A) A sup p{A) - P(A) . (2) 

p(-)eM 

For every F-probability L(-) and P(-) are conjugated, i.e. L{A) = 1 — P(-iA), 
VA G A. Therefore every F-probability is already uniquely determined either 
by L(-) or either by P(-) alone. Here L(-) will be used, and T — (Q; A\ P(*)) 
will be called an F-probability- field. 

The strong relation between interval-probability and non-empty sets of clas- 
sical probabilities expressed by the concept of the structure (c.f. (1)) proves 
to be quite important for the theory as a whole. It establishes the connec- 
tion to classical probability theory and distinguishes the concept of interval- 
probability from a simple interval-arithmetic, which in most cases would 
lead to vacuous results. 

Property (2), which is essential for F-probability, is not new in the literature, 
but usually it is derived from stronger assumptions, which the theory of 
interval-probability treats as special cases. Rather popular is the concept 
of C-probability, which provides a superstructure upon the neighbourhood- 
models commonly used in robust Neyman-Pearson-testing.^ 

Def. 2.2 Let (Q, A) be a measurable space. F-probability P(-) = [L(-); [/(•)] 
is ca//ed C-probability, ^/^(') ^5 two-monotone, i.e. if: 

L(A U B) + L(A n B) > L(A) + L(B), VA, B G A . (3) 

Then the F-probability- field C = (f2; A; !/(•)) is called a C-probability-field. 

An important result of Weichselberger’s work is the conclusion that the 
condition of F-probability alone is sufficient to derive a neat theory. So a 
principial restriction of interval-probability to C-probability is not necessary. 
In the opposite, this would be quite unsatisfactory: It can be shown that 
the expressive power of the concept of interval-probability is mainly due 
to the extension of the calculus to arbitrary F-probability-fields.^ The set 
of all C-probability-fields is too narrow to model all the intended phenom- 
ena. Therefore, when developing a theory of statistical testing for interval- 
probability, one should try to allow for arbitrary F-probability-fields. 

"^C.f. for instance Huber and Strassen (1973) and Section 4 of this paper. Furthermore, 
totally-monotone C-probabilities, which are a subclass obeying to further monotonicity- 
constraints, are the basic concept of the so called Dempster-Shafer-Theory (e.g. Yager et 
al. (1994)). 

^To mention just two examples (For a detailed argumentation see: Augustin (1998, 
Chapter 1.2.3)): The generalization of the usual parametric families to interval-probability 
(like the normal- or the binomial-distribution) leads to F- but not C-probability. Also, 
the set of all C-probability-fields is not closed under a lot of operations. 
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3 The Testing- Problem; Optimality- Criteria 

Like in classical statistics the first step for a detailed study of tests is the 
alternative-problem, where one probability is tested against another one. 
Now the hypotheses may consist of F-probabilities: 

Problem 3.1 Consider two F-probability-fields !Fo — andTi = 

A\ Li{-)) on a measurable- space with disjoint structures Mq and 

M.\, Assume further that the set {a;} is measurable for every uj E and 
that Mq and Mi have a positive distance with respect to an appropriate 
metric. After observing a singleton {cj} =: E,^ which has the probabil- 
ity Pq{E) = [Lq{E)]Uo{E)] or Pi{E) = [Li{E);Ui{E)] to occur, an opti- 
mal decision has to be made between the two hypotheses Hq : “The ‘true’ 
probability-field is Pq.” and Hi : “The ‘true’ probability-field is Pi.” 

Here, the neat connection to classical probability and statistics shows its 
fruitfulness for the first time. The concept of random-variable is still valid 
in the new theory of probability. Therefore, just like in classical statistics, 
every data-dependent decision between the hypotheses can be decribed by 
a test, i.e. by a (^-jB-measurable) function — > JR, where (p{u) is 

the classical probability^ of rejecting the null hypotheses, if {a;} is observed. 
The set of all tests on is denoted by $. 

Also the development of optimality- criteria can directly be based on the 
methodology of classical statistics. Optimality-criteria will, as usual, depend 
on the probabilities of the errors, which now may be interval- valued. To 
define them in the frame of interval-probability an appropriate generalization 
of expectation is needed, which is gained straightforwardly by using the 
structure: For every F-probability-field P = (SI; ^; !/(•)) with structure M 
a random variable X on (SI; .4) is called M-integrable, if X is p-integrable 
for each element p(-) of M. Then 

EmX [LEmX] UEmX] := [ inf E^,X ; sup E^X] C [-oo; oo] (4) 

p(-)eM p(-)eM 

is called (interval-valued) expectation of X (with respect to P). The appli- 
cation to tests, which are trivially integrable, leads directly to: 

Def. 3.2 In Problem 3.1 for every test y>(-) G $ 

EmoT called probability of the error of the first kind of 

EmiT ca//ed power ofcp, and 

Emi (1 “ t) called probability of the error of the second kind of cp. 

®For technical reasons the formulation uses sample-size 1. Situations with sample-size 

n are included by considering the independent products Pq”'^ and P^^ of Pq and P\. 
(For the definition of indendent products of F-probability-fields: Weichselberger (1999, 
chapter 7); see also: Augustin (1998, Definition 1.16).) 

^The concept of randomization is based on an idealized random-experiment; therefore 
it makes sense to describe it by classical probability (and not (necessarily) by interval- 
probability) . 
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Now the different optimality-criteria from classical statistics can faithfully 
and consequently be generalized. Doing this, the extension to the new calcu- 
lus asks for an additional fundamental decision: Since the probabilities of the 
errors may now be interval- valued, one has to choose between several natural 
ways of orderings. The easiest one is to produce a linear ordering by judging 
the probabilities of the errors according to their upper interval-limits, but 
also proper interval-orderings can be formulated, generally leading to partial 
or weak orders.® — Now the Neyman-Pearson point of view is taken and 
only the upper interval-limits of the probabilities of error are considered. 
Then the Neyman-Pearson postulate “Minimize the probability of the error 
of the second kind while controlling the error of the first kind” leads to 

Def. 3.3 Consider Problem 3.1, and let a level of significance a € (0; 1) 
be given. A test cp*{-) e ^ is called a level-a-maximin-test (for Pq against 
Ti), if if* {•) respects the level of significance, i.e. UJE^p* < a. , and has 
maximal power among all tests under consideration, i.e. 

y-ip e ^ [uiEMoi’ < CK < lEmiP* ] ■ (5) 

This is a nonparametric maximin-problem between the structures. The “pa- 
rameter” is inifinite dimensional, each of its values is (a one-to-one image 
of) a classical probability. 



4 Globally Least Favourable Pairs and the 
“Necessity” of C-Probability 

One may try to construct level-a-maximin-tests by searching for two ele- 
ments qo{-) and gi(-) of the structures, which are “least favourable"' : They 
are so hard to discriminate that testing between the whole structures does 
not lead to higher probabilities of error than testing between {go(0} ^■nd 
{^i(-)}. Then the best test for {qo{-)} against {9i(-)}, which is rather easy 
to calculate, can be expected to be also level-a-maximin-test. 

Def. 4.1 Consider Problem 3.1. A pair {qo{-) , qi{-)) of K-prob abilities is 
called a globally least favourable pair (for (Fq against T\), if 

1 . (^o(’)) 9i(')) ^ -^0 X All . 

2. There is a version of the likelihood ratio of qo{-) and qi{-) with 
Vt>0, Vpo(0^A4o : Po({wl7r(o;)>t}) < go ({w |7r{o;) >t}) 

Vt>0, Vpi (OeAli : Pi ({w |7r(o;)>t}) > gi ({w |7 t(w) >t}) 

The heuristics given above can be formally supported. 

Prop. 4.2 If {qo{-) , qi{-)) is a globally least favourable pair for Ep against 
El, then there exists a best level-a test for Hq : {go(')} against Hi : {gi(-)}> 
which is level-a-maximin-test for Eq against Ei, too. 

*For examples of such orders: Weichselberger (1998; Chapter 2.6). 



( 6 ) 

( 7 ) 
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Huber and Strassen managed to state a condition, which is sufficient for the 
existence of globally least favourable pairs: 

Prop. 4.3 (Huber-Strassen-Theorerr^ ) Consider Problem S.l (on a Polish 
space [Q,A)). If J-q as well as J-\ are C-probability-fields with 
(^n)n€W t A, An Open, n€ N l^Li{An) = Li{A) , f e {0, 1} (8) 

then there exists a globally least favourable pair for J-q against J-\. 

An extension to F-probability has not been considered so far, because the 
result below was understood to show the impossibility of a generalization: 

Prop. 4.4 Consider a finite space and an F-probability-field J-q = 
(f2; 'P(fl); L{-)) with structure J\4q. If there exists for any K-probability Pi(-) 
with pi{-) ^ AAq a K-probability po{-) € Aio in such a way that (po('))PiO)) 
is a globally least favourable pair for Pq against Pi {Q,V{^),pi{-)), then 
pQ must be a C-probability-field. 

The consequence has been an exclusive concentration on models produc- 
ing C-probability. Huber (1973; p. 182) himself called Property (3), the 
condition of two-monotonicity, the “crucial one to obtain a neat theory” 
and Lembcke entitled his article (Lembcke (1988)), where he introduced 
his generalization of Proposition 4.4, “The necessity of [...C-probability] for 
Neyman-Pearson mimimax tests”. Though - as mentioned in the second 
section - this would be rather unsatisfactory from the viewpoint of mod- 
elling, the restriction on C-probability has seemed to be the inevitable price 
to pay for testing with interval-probability. 



5 Overcoming the Restriction to C-Prob. 

Now a sketch of some considerations from Augustin (1998) is given, where 
it is shown that an extension to F-probability is de facto possible; results 
in analogy to the ones stated above can also be derived for the much wider 
concept of interval-probability. Two lines of argumentation proved success- 
ful: 

First, the “necessity” stated in Proposition 4.4 might be toned down! C-prob- 
ability is necessary to garantee the existence of a globally least favourable 
pair for all possible alternative hypotheses, but the proposition does not ex- 
clude the existence of a globally least favourable pair in one concrete testing 
problem, where neither Tq nor are C-probability-fields. Furthermore, it 
can be shown that there is quite a plenty of relevant models, where both 
hypotheses are not described by C-probability-fields and where nevertheless 
globally least favourable pairs exist. 

^Huber and Strassen (1973; p. 257, Theorem 4.1); Buja (1986) 

^°After Huber and Strassen (1973; p. 262, Theorem 7.1) (for finite spaces). Lembcke 
(1988; p. 123, Theorem 2.3) extended this proposition to Polish spaces. 

^^Augustin (1998; Chapter 6.1). Important examples for this are certain generaliza- 
tions of neighbourhood-models to interval-valued central-distributions, e.g. contaminated 
F -normal-distributions . 
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Second, even in situations, where there is no globally least favourable pair, 
one has not to do without the possibility of a reduction to least favourable 
elements of the structure. The concept of globally least favourable pairs can 
be modified in a way that the main argument of the proof of Proposition 4.2 
remains valid. If the level of significance ot is given and fixed (as usual in 
Neyman-Pearson-theory), it is only of importance to find K-probabilities, 
which are least favourable for that concrete level of significance ( ‘locally’’). 
This is a much weaker condition, but it will nevertheless prove to be sufficient 
for getting a level-a-maximin-test between !Fq and 

Def. 5.1 Consider Problem 3.1, and let a level of significance a G (0; 1) be 
given. A pair (^o(-)? of K-probabilities is called a (level-a-)locally least 
favourable pair (for To against Ti), if 

1 - (9o(-),9i(-))e>(oxA^i. (9) 

2. There exists a best test for testing {^o(*)} ogainst 

UEmoT* a = LEmiT* • ( 10 ) 

Indeed Proposition 4.2 is - mutatis mutandis - still valid and an analogon 
to Proposition 4.3 can be formulated: 

Theorem 5.2 If To and T\ are fulfilling the condition^^ 

{An)neiN t Ane N li^Li{An) = Li{A) , z G {0, 1} , (11) 

then there exists a locally least favourable pair for To against T\. 

Another way of reducing the testing problem becomes quite important for 
constructing the level-a-maximin-test. Often it is much easier not to try 
to reduce the problem directly to single K-probabilities but rather to other 
parts of the structure. In this context the sets S(A4i) of all extreme points 
of the structures Aii play a distinguished role: 

Theorem 5.3 Assume To and T\ to be obeying to Condition (11). Then a 
test (p* is level-a-maximin-test for To against T\, if and only if it is level- 
a-maximin-test for £ (Mo) against £{Mi), i.e. z/ supp^.)^^;^^^) < a, 

and , . , . 

Vz/^ G $ [ sup Ep 'll) < a inf Ep < inf Ep ip*] . (12) 

p(-)e£(Mo) p(‘)es(Mi) p(-)es(Mi) 

A short remark concerning the proofs of the Theorems 5.2 and 5.3: Both theorems 
are based on the facts that every structure is convex and that Condition (11) is 
sufficient for the compactness (in the weak-star-topology^^^T*) of the structure. 
Theorem 5.2 can be shown by adopting results from Baumann (1968) and Ganssler 
(1971), while Theorem 5.3 applies the theorem that - under quite mild assump- 
tions - a continuous linear functional attains its infimum (and its supremum) over 
a compact convex set in one of the extreme points (e.g. Holmes (1975, p. 74)). 

^^The results in the rest of this subsection are taken from Augustin (1998, Chapter 3.3) 
^^Note that Condition (11) is a bit stronger than its pendant (8), because it refers to 
arbitrary measurable sets, and not only to the open ones. 

is different from the topology normally considered in probability theory (e.g. in 
Billingsley (1968)), which also is called weak- star- topology. 
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6 Some Further Remarks 

Before looking at some selected further results the question has to be dis- 
cussed - the referee is quite right in emphasizing this aspect - whether 
procedures solely based on classical statistics may lead to the same practi- 
cal solutions. Obviously, the ideas underlying interval-probability and for 
instance classical Bayesian concepts differ essentially in their attitude to 
uncertainty: As set out in the introductory sections, interval-probability is 
serious about the conclusion that there are often types of uncertainty which 
can’t be adequately modelled by classical probability. Therefore ‘inside the 
structure’ there is ‘uncertainty in the strict sense’ like in the non-Bayesian 
approaches to decision-making, in Neyman-Pearson confidence intervals and 
so on. A Bayesian would deny this and would be tempted to specify a prior 
distribution for describing the probability of each element of the structure to 
be the ‘true’ one. But which prior should one take, if one has no information? 
Every prior rules out the uncertainty inside the structure and equates the 
whole set of classical probabilities with an averaged one. Therefore, even the 
so-called non-informative priors are quite informative!^^ — However, look- 
ing at the optimal test (^*(-) based on a least favourable pair a 

formal ex-post relation can be established between the procedures based on 
interval-probability and classical Bayesianism. If one happened to take just 
one of those priors 7 To(-) and 7 Ti(-) on the structures Mq and Adi, whose 
mixing-procedure lead to qo{‘) ans qi{') resp., one would obviously get the 
same test. — The difference remains in the way how qo{-) and qi{-) are un- 
derstood. While for a Bayesian the ^i(*)s are based on the prior knowledge 
and describe the expected average behaviour of the structures Mi in any 
situation, in the present context go(*) ^iid gi(-) are formal substitutes having 
no meaning as average elements. They represent the interval-probabilities 
only formally and only in that concrete testing-problem with the particular 
hypotheses and the concrete level of significance chosen. 

To mention some further results on testing with interval-probability^^: On 
finite sample-spaces Theorem 5.3 proves to provide an universal method for 
calculating the optimal tests. It allows to reformulate the level-a-maximin- 
testing-problem between the structures, which normally consist of uncount- 
able many K-probabilities, as a linear optimization problem with a finite 
number of restrictions. Theoretically, therefore the problem can always be 
solved by means of linear programming. For sound practical application, 
however, further research is needed to reduce the computational complexity 
arising. Applying the duality-theory leads to an extension of the Neyman- 
Pearson-Lemma to the situation considered here and to algorithms for the 
calculation of least favourable pairs. — In the case of infinite sample-spaces, 
of course, such an universal algorithm cannot be expected. The construction 



Taking a set of priors and a prior on that set would lead to an infinite regress (see 
in particular Huber (1976, p.91f.)). — For criticism of the ‘Bayesian dogma of ideal 
precision’ from a Bayesian viewpoint see the book of Walley (1991). 

^^taken from Chapters 4 - 6 of Augustin (1998) 
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of least favourable pairs and optimal tests has to exploit specific properties 
of the concrete models used to formulate the hypotheses. — The background 
for most of the results reported here are general theorems about interval- 
valued expectation. Therefore, it seems promising to apply them to other 
optimality criteria for tests and to statistical decision-making. 

The considerations outlined in this paper should have supported the thesis: 
Interval-probability is powerful to combine classical statistics with a wider 
concept of uncertainty. Therefore interval-probability is promising to be an 
appropriate answer to the need for a well-founded methodology for data- 
based reasoning in the presence of uncertain knowledge. 
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Abstract: Test procedures for the analysis of multivariate problems are intro- 
duced, which are more powerful than the conventional Hotelling’s for detecting 
alternatives where the (treatment) effect has the same direction for all observed 
variables. They are able to analyse incomplete data, versions exist which do not 
require normally distributed variables, and, for complete data, the number of 
dependent variables can arbitrarily exceed the number of independent subjects. 



1 Introduction 

The evaluation of treatment efficacy, e.g. in clinical trials, often leads to a 
situation in which several endpoints are of (equal) importance for the (clin- 
ical) judgement, for example if repeated measures are obtained to assess 
longitudinal differences between two groups of subjects. The separate statis- 
tical analysis of such multiple endpoints in a comparison between two treat- 
ments results, however, in a multitude of p-values, whereas in many cases 
a single p-value would be preferable. Moreover, one is often concerned with 
equidirected treatment differences, where one is only interested in whether 
a treatment reveals some improvement upon placebo with respect to all re- 
sponse variables. Multivariate two-sample tests provide a solution to the 
latter problem by summarizing the differences between two treatments with 
respect to all endpoints in a global test statistic leading to a single p-value. 
This approach proposed by O’Brien (1984) turned out to be a considerably 
more powerful test for detecting restricted alternatives than Hotelling’s T^. 
A variety of such directional tests has been proposed during the last couple 
of years (e.g. Lachin (1992), Lauter et al (1996)). In the following, it will 
be shown that the basic ideas of the parametric procedures can be used to 
construct powerful and widely applicable rank tests which even allow mixed 
observation vectors consisting of both ordinal categorical and quantitative 
components. Furthermore all tests can be applied to incomplete data, with- 
out being confined to the complete observation vectors — which is common 
practice in most of the statistical analysis systems (e.g. SAS, SPSS). 

The tests have been programmed as a SAS-IML-Macro and can be imple- 
mented as a profitable supplement to existing SAS procedures for parametric 
and nonparametric analysis of unbalanced multivariate data. 
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2 A Data Example 

In a randomized clinical trial at the Outpatient Anxiety Disorders Unit of the 
Department of Psychiatry at the University of Gottingen 45 patients with a 
diagnosis of panic disorder and agoraphobia (PDA) were assigned to three 
groups: Clomipramine treatment (an antidepressant), regular aerobic exer- 
cise (which was assumed to have a beneficial therapeutic eflfect), and placebo. 
The following clinical and self-rated measures were taken at baseline ( “week 
0”) and after 1, 2, 3, 4, 6, 8, and 10 weeks: Hamilton Anxiety Scales (HAM), 
Bandelow PDA-Scale in an observer-rated version (BPAS-0) and patient- 
rated (BPAS-P), and Clinical Global Impression in a rater-version (CGI). A 
simple approach for an analysis of such data could be a group-wise compari- 
son of each of the scores above (i.e. the differences to the baseline, supposing 
the variables can be regarded as quantitative measures) separately for every 
week, whereby, in regard to the nature of the data, rank methods should 
be used rather than parametric tests. However, this approach results in no 
fewer than 28 p-values (4 variables at 7 times) even when comparing only 
two groups. Apart from the necessity of adjusting the a-level, this multitude 
of p-values does not yet give the information which is actually needed: Is 
one of the treatments superior with respect to all scores at all times? Indeed, 
regarding only the (two-sided) p-values corresponding to the mean differ- 
ence (exercise minus clomipramine group) of the 7 differences to baseline 
of BPAS-0 (table 1) will hardly give an answer to the question of overall 
superiority. Even the classical multivariate parametric approaches to prob- 
lems of such kind. Hotelling’s for example, address the wrong question, 
because the “quadratic” nature of those tests does not take into account the 
direction of the treatment differences. Hence, a treatment difference can be 
“significant” when treatment A is superior to B for some variables or times 
and B is superior to A for some other measures or times. Here, clearly no 
decision can be made concerning an overall superiority, even if the test pro- 
vides a small p-value. On the other hand. Hotelling’s test is lacking in 
power especially for these alternatives of interest. This problem arises in a 
variety of clinical trials, above all those with repeated measurement designs. 
One solution could be to drop the “quadratic” approaches, like Hotelling’s 
T^, and to construct the test statistics as simple sums of variables (which, 
of course, have to be standardized in an appropriate manner). This, for data 
without missing values, was the suggestion of O’Brien (1984), which will be 
sketched and extended in the following section. 



3 Summary Statistics of the O’Brien Type 

3.1 Model and Test Statistics 

Let X-^,i = 1,2; j = 1, . . . , n,; A; = 1, . . . ,K, the A:th measurement of sub- 
ject j in group i. In section 3.2 some of the may be missing at random 
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week 


1 


2 


3 


4 


6 


8 


10 


BPAS-0 (mean diff.) 


-2.93 


-0.20 


6.99 


4.68 


6.30 


5.27 


3.07 


p-value 


0.53 


0.90 


0.02 


0.15 


0.11 


0.04 


0.72 



Table 1: Differences {exercise versus clomipramine group) of the means of the 
differences to baseline (variable BPAS-0) and p-values (Mann-Whitney test). 



(more precise conditions for the missing proportions are stated there). The 

stochastically independent vectors = {Xlj\ . . . are assumed to 

have cumulative distribution functions with marginal distribution func- 
tions expected values E{yiij) := = (/^a? • • • , s^nd covariance 

matrices Cov(Xij) == (not singular) with Si = S 2 S. Let V denote 

the covariance matrix of the mean difference vector Y {Yk)k=i,...,K^ with 

Yk xf,"^ — X^^^ := V^i ~ / '^ 2 , which is a consistent estima- 

tor of the treatment difference /Xi — /X 2 - A suggestive way of estimating the 
treatment difference and the corresponding covariance matrices in the case 
of incomplete data is described below. 

In order to detect an overall (treatment) superiority, O’Brien (1984) pro- 
posed two parametric multivariate statistics for evaluating Hq : = /X 2 vs. 

Hi : /Xi — /X 2 > 0 = (0, . . . , 0)' where represents the mean vector of treat- 
ment i (z = 1,2) involving K endpoints, and > means a component-wise 
comparison. O’Brien’s OLS and GLS test statistics — their names corre- 
spond to the ordinary least squares and generalized least squares technique 
— can be regarded as special cases of 



T\= 



d^Y 

(d'fd)i/2 



( 1 ) 



with a weight vector d and f := Cov(Y), where d = 1 := (1,...,!)'^ 
is possible. In O’Brien’s original approach the weight vectors were d = 

(af \ . . . , d]^)' for the OLS test and d = S , gk)' for the GLS test, 

where dk denotes the usual pooled estimator of the standard deviation of 
the kth endpoint, i.e. the statistics are (standardized) weighted (GLS) and 
unweighted (OLS) sums of the (standardized) mean treatment differences. 
In section 3.3 it will be discussed, however, that these pooled estimators 
should be replaced by the non-pooled versions. 

In order to guard against a common misunderstanding, it should be pointed 
out that tests based on linear combinations of endpoints of course cannot 
ensure that Hi : — /ij > 0 exists in the situation under consideration — 

but the main feature of the tests discussed here is their higher power in the 
presence of such equidirected alternatives. 
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3.2 Parameter Estimators for Incomplete Data 



If the data vectors are incomplete due to randomly missing observations, an 
appropriate way of estimating the mean vectors and covariance matrices has 
to be found. (Of course, techniques for the handling of “informative” missings 
are desirable, but hard to find in statistical theory.) The simplest method is 
implemented in most of the statistical analysis software holding the market 
today (e.g. SAS, SPSS): the complete case analysis, where only the complete 
vectors are taken into account — a rarely satisfactory technique with regard 
to the loss of information. Little and Rubin (1987) discussed a variety of 
methods for the handling of incomplete data, which are, however, confined to 
parametric models. In a different context Stanish et al. (1978) recommended 
a simple and effective method of an available case analysis (i.e. analysing all 
available data without deleting or substituting observations), which turned 
out to be applicable not only to the situation under consideration but to the 
nonparametric approach as well, which will be described in the next section. 
Let now nik be the number of nonmissing observations in group i for the A:th 
measurement, and um shall denote the number of nonmissing observations 
for the A:th and Ith variable simultaneously. It is necessary for the nik to 
meet the following asymptotic requirements: (a) Nk := '^ik oo {k = 
(b) 3rk with 0 < < Uik/Nk < 1 - Tk < 1 {k = 1, . . . , K). Define 






(k) 



^ ij ' 






:= and 



:= where 



'^3 ’ i- * riik ^3 

Y, shall always mean the sum of all nonmissing elements. Let denote 



the covariance matrix with elements Cov{X-j\X-j^) and Ej := {diki)k,i the 
estimator with 



^ikl — 



E ni ( vW 



— 



-xf){x^-xf) 

'^ikl ~ 1 



( 2 ) 



Then the pooled estimator of E is 

S = ^ ((^1 - 1)^1 + ( ri 2 - 1 )^ 2 ) . 



A consistent estimator of Cov{xf\xf^) is where Cov* 

denotes the conditional sdiUiple covariance given in (2), based on the Uiki ob- 
servations which have data present for both dependent variables. The factor 
can be regarded as a missing correction factor with the property of 
adjusting the conditional covariances in order to obtain a positive semidef- 
inite covariance matrix. This can be shown following the argumentation in 
Stanish et al. (1978). Consequently, and constructed in this manner 
are consistent estimators of and E^, respectively, and make use of all 
available data. 
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3.3 Distribution of the Mean Treatment Difference 
Vector and the Test Statistics 

Assuming randomly missing observations, the application of a multivariate 
central limit theorem (Rao (1973), p. 127) ensures the asymptotic normality 
of Y, which is a consistent estimator of the treatment difference fXi — ^ 2 . 

y/NY ^ Nk{0,T) 

with N Til 722, where the elements of F for A:, / = 1, . . . , K are 

Ikk = o.kkN (— + —), lki = o.kiN for k<l. 

r can be consistently estimated by using the elements of E. (To avoid con- 
fusion, note that F is reused here with a slightly different meaning.) Using 
these estimates in the formulas of O’Brien’s OLS and GLS statistic ensures 
the asymptotic normality of these statistics under H^. It should be noted, 
however, that even under the condition of normally distributed data, the 
OLS and GLS statistics are only asymptotically normally distributed, when 
the were standardized by their pooled empirical standard deviations. 
In practice, this results in an anticonservative behaviour of both tests (GLS 
more than OLS), which was pointed out by Lauter et al. (1996) and Bregen- 
zer, Lehmacher (1995) in simulation studies and could be shown analytically 
by Frick (1997). Lauter et al. (1996) proved that for normally distributed 
data the OLS statistic exactly follows a t distribution with N — 2 degrees of 
freedom, when using the non-pooled empirical standard deviation for stan- 
dardizing the instead of the pooled version, but it should be noted 
that the covariance matrix in the denominator of Tqls in ( 1 ) is still based 
on the pooled estimator. Lauter called this test standardized-sum test The 
same idea (using the non-pooled empirical standard deviation for standard- 
izing the variables) applied to the GLS test removes its anticonservatism, 
too. Another problem of the GLS test, however, persists: The weight vector 

d = S (ai, . . . , gk )' may have negative components, and this has the con- 
sequence that alternatives exist in the positive orthant for which the GLS 
test has no power and therefore is biased. 



4 Nonparametric Tests 

4.1 The Nonparametric Model and Effects 

As even a short summary of the underlying nonparametric theory in the case 
of incomplete multivariate data with not necessarily continuous distributions 
would be far beyond the scope of this paper, only the basic ideas shall be 
sketched here (for details refer to Munzel (1996) and Bregenzer (1998)). Let 
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Xij = (^jj \ • • • ) V? ^ — 1) • • • ? 2; j = 1, . . . , n^, again denote the inde- 
pendent identically distributed random vectors with distribution function 
Fi and marginal distributions For the moment the vectors shall con- 
tain no missings. The basic principle of the nonparametric approach is the 
formulation of the hypotheses and effects in terms of the distribution func- 
tions (introduced by Akritas and Arnold (1994)) and the estimation of the 
effects by replacing the distribution functions (say) F with their empirical 
versions (written as F), which yields sums and means of ranks (when using 
Wilcoxon scores and ranking per variable). With the weighted mean of the 

distribution functions ^F}^^ and ^nd 

score functions the nonparametric relative (treatment) effect vector p 
with components := / J^^\H^^\x)]dF-;^\x) can be formulated, 

which can be regarded as an extension of the expected values / x dFj;^^ (x) 
with another integrand function than identity or, for the identity, as a 
weighted mean of the proversion probabilities J dFj;^\i' i. The esti- 
mation of the vector of p is easy, because it can be expressed in terms of the 
ranks of the observations: 



p= 









i=l,2]k=l,...,K 



where denotes the rank of among all observations of the A:th vari- 
able. 



4.2 Nonparametric Tests of the O’Brien Type 

The asymptotic normality of ^/Np can be shown (Munzel (1996)) and hence, 

with := with weights Vk, a suitable matrix C and the 

identity, the asymptotic normality of 



- 4"' = Cp . (3) 

For the special case Vk = l]k = 1,...,F, is this the nominator of the 
nonparametric version of the OLS statistic proposed by O’Brien (1984). As 
a perhaps surprising fact it can be mentioned that O’Brien only suggested a 
rank version of the OLS test, completely neglecting the GLS test. Of course, 
p can be used to construct a nonparametric version of Hotelling’s test as 

well. Writing (3) in another way and using weights v = 1 and v = i's« 
with 



= E EiRij - Ri-){Rij - Ri-Y 

^ j=l 



i=l 
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and Rij = {Ri]\ . . . , the rank versions of the OLS and GLS statistics 

from equation (1), 



ToLS(rank) 



- R2) 



and 



TGLS{rank) • — 






( 4 ) 



are asymptotically normally distributed under Hq. Simulations showed that 
for small samples a t distribution with — 2 df should be used rather than 
the normal distribution approximation. 

Since, as a consequence of the ranking per variable, any standardization as 
in the parametric case is unnecessary here, the OLS test in the rank version 
(assuming a t distribution), unlike its parametric version, keeps the nominal 
Qf-level and is identical with the rank version of Lauter’s standardized-sum 
test. However, the GLS rank test again exceeds the nominal a-level, but us- 
ing a non-pooled covariance estimator (in the following indicated by RL) in 

the weight vector ^he classical pooled version in the denominator, 

which is then will remove the anticonservatism here as 

well. Since the GLS method produces optimal estimators and consequently 
leads to optimal tests, one can expect that the GLS test will be more power- 
ful than the OLS test, but, when restricting the view to the versions which 
keep the nominal a-level, simulations show that this effect can be observed 
only for large sample sizes (ni,ri 2 > 30). In smaller data sets the two tests 
perform quite comparably. 

Brunner and Puri (1996) introduced a variety of multivariate nonparametric 
procedures in the two-sample problem, based on the idea of formulating non- 
parametric hypotheses, effects, and test statistics in terms of the distribution 
functions which was sketched above. But their tests are of a “quadratic” na- 
ture (like, for example, Hotellings test) and hence are appropriate for 
detecting any departure from the null hypothesis, whereas the test proce- 
dures proposed here are more powerful for detecting directional alternatives. 
For incomplete data the technique presented in section 3.2 can be applied 
without impairment of the asymptotic theory. But here the ranks should be 
weighted by the number of available observations per variable (i.e., Wilcoxon 
scores should be used rather than ranks) in order to avoid that different 
missing proportions in the variables lead to an undesirable weighting in the 
statistics (4). 



5 Applications 

As a consequence of the reduction to a univariate problem, for complete data 
(then ToLS{rank) ^nd TcLSirank) ^re equivalent to a t test statistic composed 
of the unweighted (OLS) and weighted (GLS) sum of ranks like in formula 
(3)) the number of dependent variables can arbitrarily exceed the number of 
independent observations. This can be useful for longitudinal data or exten- 
sive repeated measures designs, where the classical multivariate approaches 
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(like Hotelling’s T^) may turn out to be no longer applicable. Moreover, the 
nonparametric tests allow mixed variable vectors of metrical and (ordinal) 
categorical components to be analysed in a single step — an opportunity 
which cannot be offered by the classical parametric multivariate methods. 
The power benefit of the directional tests related to Hotelling’s has been 
demonstrated in simulation studies (Bregenzer, Lehmacher (1995)). 
Returning to the data in section 
2, one can now further investigate 
the mean rank differences {exercise - 
clomipramine) for the variable BPAS- 
O. Repeated univariate testing was not 
suitable for detecting overall superior- 
ity. Hotelling’s test in the rank ver- 
sion (adjusted for incomplete data as de- 
scribed above) yields p — 0.045, but this 
is indeed only an indication of any depar- 
ture from the null hypothesis. O’Brien’s 
OLS test in the rank version finally gives 
p — 0.230, i.e. there is no evidence for 
superiority of one of the two treatments. Regarding, as another example 
but without listing the data, both the BPAS-0 and BPAS-P scale for 
clomipramine versus placebo (here all mean differences have the same sign). 
Hotelling’s test in the rank version yields p = 0.0551, whereas again 
O’Brien’s OLS test in the rank version gives p = 0.0001 — and one could ex- 
pect that this is a consequence of its higher power for detecting equidirected 
differences. When simultaneously analysing all scales at all times (result- 
ing in 28 variables: 4 scales with 7 differences to baseline), the comparison 
of exercise and the clomipramine group with the OLS (rank) test gives a 
p=0.022. Here it should be noted that, taking into account the incomplete 
data, a complete-case analysis as a standard in most of the statistical soft- 
ware systems would have to exclude the substantial portion of 19 of the 45 
patients from the analysis, whilst the available-case analysis as presented 
here allows to use the whole data set (table 2). The computational realiza- 
tion of multivariate two-sample tests for incomplete data with respect to the 
exploitation of all available data (i.e. without deleting observations) seems 
to be not available in the major statistical software packages. To overcome 
this unsatisfactory situation, a SAS macro has been developed by the author 
which makes use of the SAS/IML matrix language and includes all test pro- 
cedures which were outlined here. It can be implemented on every platform 
with SAS version 6.04 or higher. A free copy of the collection is available on 
request from the author, 
email: bregenzer@uke.uke.uni-hamburg.de 





vectors I 


treatment 


complete 


incomplete 


pcb 


9 


6 


ex 


8 


7 


med 


9 


6 


E 


26 


19 



Table 2: Number of complete and 
incomplete vectors in the data 
example 
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Classification and Positioning of 
Data Mining Tools 

W. Gaul, F. Sauberlich 
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Universitat Karlsruhe, D-76128 Karlsruhe, Germany 

Abstract: Various models for the KDD (Knowledge Discovery in Databases) 
process are known, which mainly differ with respect to the number and description 
of process activities. We present a process unification by assigning the single steps 
of these models to five main stages and concentrate on data mining aspects. An 
overview concerning data mining software tools with focus on inbuilt algorithms 
and additional support provided for the main stages of the KDD process is given 
within a classification and positioning framework. Finally, an application of a 
modification of an association rule algorithm is used as empirical example to 
demonstrate what can be expected when data mining tools are used to handle 
large data sets. 



1 Knowledge Discovery in Databases and 
Data Mining 

Because of advances in data storage technology and the explosive growth 
with respect to the capabilities to generate and collect data, companies are 
nowadays more and more aware of their “data mining” and “data warehous- 
ing” problems (to use these trendy labels for situations well known to the 
data analysis community for a long time). The possibility to raise enor- 
mous quantities of data (from which in practice often only a relatively small 
amount is actually used) has led to demands for “easy” data handling, out- 
paced the traditional abilities to interpret and digest such data and created 
a need for tools and techniques of automated and intelligent database anal- 
ysis. The area of Knowledge Discovery in Databases (KDD for short) tries 
to cope with these problems. 

In practice the terms KDD and data mining are often used synonymously 
while in research a differentiation of these two terms is usual. KDD denotes 
the non-trivial process of identifying valid, novel, potentially useful, and 
ultimately understandable patterns in data (Fayyad et al. (1996)). Thus, 
the term KDD describes a whole process of identifying patterns in data, 
whereas data mining is just a step in the KDD process consisting of the 
application of particular data mining algorithms that, under some acceptable 
computational efficiency limitations, produces a particular enumeration of 
patterns (Fayyad et al. (1996)). 

In 1989 the term data mining was not mentioned among the database man- 
agement systems research topics for which it was postulated that they would 
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deserve research attention in the future, whereas in 1993 data mining to- 
gether with other topics already occupied the second position (Stonebreaker 
et al. (1993)). 

As in practice software offers in the KDD area are often called data mining 
tools, we will also use this term, although most of the facilities in these tools 
try to support several main stages within the KDD process. 



2 KDD Process Models 

Table 1 shows four KDD process models, which reveal a similar structure 
but offer different activities within their process representations. 





Task Analysis 


Preprocessing 


Data Mining 


Postprocessing 


Deployment 


Brachman, 
Anand (1996) 
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■ 




Data 

Cleaning 




Data 
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0 
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utput 

eration 
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Fayyad et al. 
(1996) 
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Pre- 

processing 
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Mannila (1997) 
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Preparing the 
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Wirth, Reinartz 
(1996) 
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Knowledge 
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pr( 


Pre- 
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Pa 

Extr 
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Pc 
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»st- 

«sing 


Deployment 



Table 1: Main stages of the KDD process 



Brachman, Anand (1996) start with task discovery as a step, in which re- 
quirements with respect to tasks and resulting applications must be engi- 
neered. Data discovery and data cleaning activities follow before in model 
development and data analysis steps certain data mining techniques have to 
be selected and applied to the data. Finally, an output generation step is 
mentioned. Because of page restrictions the description of the process mod- 
els of Fayyad et al. (1996) and Mannila (1997) is restricted to the terms used 
for labelling the single process steps in Table 1. In the last row of this table 
Wirth, Reinartz (1996) are mentioned who formulate a requirement anal- 
ysis step in the beginning, in which characteristics, needs and goals of the 
application are considered, and continue with a knowledge acquisition step, 
in which availability and relevance of different types of knowledge are deter- 
mined before preprocessing, actual pattern extraction, and postprocessing 
are performed. The label “deployment” for their last step stresses the point 
that more than just output generation is needed to turn scientific activities 
to successful applications. 
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Of course, whenever various descriptions of an underlying phenomenon are 
available one can look for structural similarities. We propose a unification 
of the just mentioned different process representations in terms of the fol- 
lowing five main stages also shown in Table 1: Task analysis, preprocessing, 
data mining, postprocessing, and deployment. In the next section, we con- 
centrate on data mining aspects and examine data mining software tools 
which — despite of the generic term — try to support at least the following 
KDD process main stages: data mining, pre- and postprocessing. 



3 Data Mining Software Tools 

Table 2 gives an overview of 16 data mining software tools where we restrict 
ourselves, apart from SIPINA and KnowledgeSeeker, to so-called multi-task 
tools which support different data mining tasks and techniques. Besides 
name of the tool, company and (data mining) techniques supported, the 
platforms on which the software could be operated, price and year of the re- 
lease of the first version are mentioned. The last two columns show, whether 
the software can be used on parallel environments and whether there are 
certain restrictions in the size of the data sets, which can be analysed. The 
survey on which the information contained in Table 2 (and in Table 3 as 
well as in Figure 1) is based was conducted up to April 1998. 

The techniques supported most often are (ranked according to importance) 
decision trees, neural networks, cluster analysis, association rules, k nearest 
neighbour, and regression. 

In Table 3, a subset of 12 of these 16 tools has been examined (selection crite- 
rion was availability of information) with respect to features concerning the 
three main steps — preprocessing, data mining and postprocessing — within 
the KDD process as well as with respect to additional features like visual 
programming, parallel environment, and platform. For the remaining tools 
shown in Table 2 some parts of the information mentioned in Table 3 were 
missing. 

We aggregated the characteristics of Table 3 in a suitable manner and used 
multidimensional scaling (Kruskal MDS and Principal Component Analysis 
together with Property Fitting) and cluster analysis (Single, Complete, and 
Average Linkage as well as McQuitty and Ward) to obtain a solution of four 
segments as shown in Figure 1 (see Gaul, Baier (1994) for details with respect 
to the application of standard positioning and segmentation procedures). 
Of course, clustering would not have been necessary to reveal the structure 
depicted in Figure 1 but it helped to get a feeling for the task of grouping 
“similar” software tools. Finally, it should be mentioned that the Kruskal 
stress for the solution depicted in Figure 1 was very good (stress=0.003) and 
that the best CCC-value was obtained for Average Linkage and McQuitty 
(CCC=0.821). 
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Name 

Clemen- 

tine 


Company 

Integral 
Solutions Ltd., 
GB 


Techniques 

Decision Trees: ID3, C4.5 
Neural Networks: MLP, Kohonen 
Association Rules: Apriori-Alg. 
Reqression 


Darwin 


Thinking 


Decision Trees: CART 




Machines Corp.. 


Neural Networks: MLP 




USA 


k Nearest Neiqhbour 


Data 


MIT- 


Decision Trees: C4.5 (Plugin) 


Engine 


Management 


Neural Networks: MLP, Kohonen, 




Intelligenter 


Fuzzy Kohonen 




Technologien 


Cluster Analysis: Fuzzy C-Means 




GmbH. Aachen 


k Nearest Neighbour (Plugin) 




Enterprise 

Miner 



inteiiigent 

Miner 



SAS Institute, 
USA 




Reqression 



Decision Trees: C4.5 
Association Rules 
Cluster Analysis: K-means 
K Nearest Neiqhbour 



Decision Trees: CART. ChAID 
Neural Networks: MLP, RBF 
Association Rules 
Cluster Analysis 
Reqression 



Neural Networks: RBF 
Cluster Analysis: K-means 
K Nearest Neighbour 
Principal Component Analysis 



Decision Trees: Based on ID3 
Neural Networks: MLP, RBF, 
Kohonen 

Association Rules 
Cluster Analysis: Propr. algorithm 
based on distance measure 



Decision Trees: C4.5 
Association Rules 
Cluster Analysis: K-means 



Decision Trees: CART, ChAID 




Partek Inc, USA Neural Networks: MLP 

Cluster Analysis: K-means 
Regression 

Correspondence Analysis 
Principal Component Analysis 



Unica T ech- Neural Networks: MLP, RBF 

nologies, Inc, C/usterA/ia/ys/s; K-means 

USA K Nearest Neighbour 

Reqression 



Lab. E.R.I.C., Decision Trees: CART. Elisee, 
Univ. Lyon, ID3, C4.5, ChAID. SIPINA 

France 



Piatform 

Unix; Sun SPARC, 
HP. Digital Alpha 
Unix 

Windows NT 



Restriction 

n.a. 





16.384 
attributes 
2'^32 - 1 rows 
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Attar Software, 
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from 995 
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No 
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GB 


Association Rules 




to 9.995 £ 






(995.- £); 






Cluster Analysis 










2.000 rows 



(F. V. = First Version; P. E. = Parallel Environment; Apriori-Alg. = Apriori-Algorithm; Propr. = Proprietary; MPP = 
massively parallel processing; SMP = symmetric multi processing; n.a = no answer; MLP = Multilayer Perceptron; 

RBF = Radial Basis Functions) Evaluation Date: April 1998 



Table 2: Data mining software tools 
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Tools as KnowledgeSeeker and SIPINA, which provide only one data min- 
ing technique (decision trees) and don’t offer much pre- and postprocessing 
capabilities are clearly separated from the other tools. 

Darwin and Data Mining Tool are examples of tools, which combine some 
data mining features with more postprocessing capabilities. 
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(Scale Transform. = Scale Transformation; Autom. Feature Sel. = Automatic Feature Selection; Visual, of Results = 
Visualization of Results; Parallel Environm. = Parallel Environment) Evaluation Date: April 1998 



Table 3: Features of selected data mining tools 

Neovista Decision Series, Pattern Recognition Workbench, Inspect, and 
Partek provide a medium number of different data mining techniques and 
preprocessing capabilities. 

Clementine, DataEngine, Intelligent Miner, and Enterprise Miner build the 
segment, which more than the others tries to support all the main stages of 
the KDD process and offers more different data mining techniques than the 
competitors under consideration. 

The visualization presented shows the positioning of data mining software 
tools in a competitive environment and confirms an expected development: 
from single-technique software products as KnowledgeSeeker to multi-task 
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tools, which try to support the whole KDD process in an integrated envi- 
ronment. A further analysis could incorporate considerations with respect 
to application areas or prices. 




Evaluation Date: April 1993 



Figure 1: Clustering and positioning of data mining tools {Sk^ k = 1, ...,4, 
is the abbreviation for segment k) 



4 Application: Generalized Brand Switching 
Analysis via a Modification of Association 
Rules 

Within the standards for data mining it is often mentioned that correspond- 
ing techniques should be able to handle huge “mountains” of so-called item- 
sets. In this regard approaches that use association rules are of interest. 

Association rule algorithms build a class of techniques used in more than half 
of the software tools shown in Table 2. Basic association rule algorithms are 
capable to handle a special form of data and — therefore — can only be used 
for a specific class of problems, e.g. market basket analysis. In the following 
we give a basic description for situations in which association rules can be 
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applied and formulate modifications needed for the analysis of consumer 
behaviour time series. 

A pair (A, Y) of itemsets, e.g., subsets of brands of a product category, can 
be viewed as starting point for building association rules in a given data base 
D of itemsets. An association rule uses bounds for support 5(X U F) and 
confidence c(A, Y) measures of X and Y to check whether the “association 
of X and F” is meaningful (where the support ^(A U F) gives the percent- 
age of itemsets in D which contain the itemset A U F and the confidence 
c(A, F) describes the fact that c(A, F) percent of the itemsets in D that 
contain A also contain F). The task of an association rule algorithm is to 
find all association rules which fulfil prescribed bounds for support and con- 
fidence values. Since the number of itemsets which satisfy given bounds can 
be very large, corresponding algorithms use special techniques to reduce the 
search space. Association rule algorithms are a class of data mining tech- 
niques which can cope with large data sets in a reasonable running time. 
An example of such an association rule algorithm is the Apriori Algorithm 
by Agrawal et al. (1996). We have used the following modifications for the 
analysis of brand switching behaviour: 

For modelling buying histories with respect to a given set of brands B = 
{p,q , ...} let T = (ti ^ ... — > tj ... tn{T)) denote an indexed individual 
buying history, i.e., a sequence of subsequently bought brands tj G B where 
n(T) counts the number of purchases described by T. Note that the same 
brand can be bought at different purchase occasions. 

We call a subhistory A of T a connected subhistory if there exists an index 
j{X) G No so that A can be written in the form A = (^j(x)+i ^ 

... -> tj(^x)+n{x)) with j{X) + n(A) < n(T) and use the symbol C to denote 
such a connected subhistory. For AcT the first and last brand of A is 
described by b{X) (beginning of A) and e(A) (end of A), respectively, and 
/(A) [:= n{X) — 1] (length of A) counts the pairs of subsequently bought 
brands. 

If for connected subhistories A, FcT there exist j{X),j{Y) G No such that 
j{X) < j{Y) and j{X) + n(A) - j{Y) + k with k - fc(A, F) G {1, ..., n(F)} 
we use the symbol AUj^F — (^j(x)+i ••• ^j(x)+n(x) — > ••• 

tj{Y)-\-n{Y)) to denote the so-called k-overlapping composition of A and F. 
Note that AU^^FcT. 

Some obvious properties are: 

(AaF)cT ^ 6(AU,F) - 6(A), e(AG,F) = e(F), 

/(AU^F) = /(A) + /(F)-A: + 1. 

Additionally, for AcT, let m(A, T, 1) be the number of times that A appears 
as connected subhistory of ZcT with j{Z) = 0 and 1{T) — 1{Z) = 1. 

Up to now the buying history of just one individual was used. Now, assume 
that / is a (large) set of individuals. Then 
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si{X) -Eieim{X,TiJ) 



counts the occurrence of X in the set 

A := {Zi I ZiCTi,l{Ti) - l{Zi) = l,j{Z,) = 0,iel} 



where A {^i | * € /} is a given set of individual buying histories that 
corresponds with the given data base D of itemsets mentioned in the general 
description in the beginning. The value Si{X) is called l-generalized support 
of X. For X, FcT with k{X, Y) = 1 



c(X,Y) 



SojXUiF) 

Sl{Y){X) 



can be labeled as generalized confidence of X and Y and gives the percent- 
age of individuals of / that have switched from X to F (Note, that k > 1 
is needed for a generalized version of the Apriori Algorithm.). This nota- 
tion contains normal conditional switching (see, e.g.. Carpenter, Lehmann 
(1985)) from a brand p to a brand q as special case in the following way: 
Set X — (p) (with 1{X) = 0) and Y = {p q) (with 1{Y) = 1), then 



Cpg ^ c{{p), {p -)• q)) 



number of occurences of {p q) in Dq 
number of occurences of (p) in D\ 



describes the entries of the well known conditional switching matrix. 



Consider an empirical example where the switching behaviour of 1254 house- 
holds with respect to a product category of 7 brands {A,B,C,D,E,F,G} was 
recorded for a certain time period. The conditional switching matrix as de- 
picted in Table 4 can be computed by “traditional counting” but if one is 
interested in what can be called “higher order associations” then the number 
of compositions of subhistories is rapidly increasing. 



brand 

from 


A 


B 


C 


D 


E 


F 


G 


A 


0,72784 


0,05282 


0,02596 


0,02417 


0,02865 


0,02507 


0,11549 


B 


0,05244 


0,53165 


0,07776 


0,06148 


0,09132 


0,03617 


0,14919 


C 


0,04192 


0,14770 


0,43114 


0,08583 


0,08982 


0,04790 


0,15569 


D 


0,04560 


0,12541 


0,07329 


0,43811 


0,09609 


0,07492 


0,14658 


E 


0,03625 


0,12875 


0,06250 


0,08500 


0,52375 


0,03250 


0,13125 


F 


0,05523 


0,10848 


0,05128 


0,06706 


0,05720 


0,42998 


0,23077 


G 


0,07503 


0,09672 


0,05041 


0,05862 


0,05920 


0,06155 


0,59848 



Table 4: Conditional switching matrix 



Using the just explained methodology “modified” association rules can be 
formulated with the help of subhistories X, T, So(XUiy), and c(X, T) to 
get deeper insights into the buying behavior of individuals based on a sam- 
ple of buying histories Dq. Table 5 shows selected results that enrich the 
information obtainable by traditional conditional switching considerations, 
e.g., the first column of Table 5 coincides with the first row of Table 4. 
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Rule (X, Y) 


c(X,Y) 


So(XO,Y) 


Rule (X, Y) 


c(X,Y) 


So(XO,Y) 


Rule (X. Y) 


c(X,Y) 


So(XO,Y) 


(A),(A-^A) 


0,72784 


813 


(A-^A), (A->A-^A) 


0,71215 


381 


(B), (B-^E^B) 


0,03812 


34 


(A),(A^B) 


0,05282 


59 |(A-^A^A), (A->A) 


0,86788 


381 


(B^E), (E-^B) 


0,41975 


34 


(A), (A^C) 


0,02596 


29 (E), (E-^E^E) 


0,36926 


233 


(B), (B^E->E) 


0,03027 


27 


(A),(A->D) 


0,02417 


27 (E->E), (E^E) 


0,70606 


233 


(B^E), (E^E) 


0,33333 


27 


(A),(A^E) 


0,02865 


32 (B-^B-^B) 


0,59726 


218 


(B), (B->G^B^B) 


0,01685 


12 


(A), (A-^F) 


0,02507 


28 


(B->B^B), (B-^B) 


0,83846 


218 


(B), (B->E-^G) 


0,00897 


8 


(A), (A^G) 


0,11549 


129 


(D-^D), (D->D) 


0,62326 


134 


(B^E), (E^G) 


0,09877 


8 



Table 5: Part of the results of the modified Apriori algorithm 



5 Conclusion 

In the last years, numerous so-called data mining software tools were in- 
troduced into the market. Within the tools for which we got information, 
decision trees, neural networks, cluster analysis, and association rule algo- 
rithms belong to the data mining techniques supported most often. The 
development in this area is directed to multi-task tools which provide dif- 
ferent techniques and solve tasks from nearly all main stages of the entire 
KDD process. But nevertheless — and this is a contradiction to statements of 
some data mining software vendors — the user has to be familiar with most 
of the methods and techniques in order to be able to solve his problems and 
to interpret the results obtained. We selected the analysis of buying histo- 
ries by association rules to stress this point and to show that modifications 
of standard descriptions and algorithms could be necessary to solve specific 
analysis tasks. 
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Abstract: The hazard rate has become an important statistical tool in the 
methodologic repertoire of modern failure time analysis. Hazard rate estima- 
tion is increasingly being employed in a variety of practical applications. In this 
paper, a brief review on nonparametric kernel methods for estimating the hazard 
rate from censored data is provided and the current software situation regarding 
implementations of this methodology is described. 



1 Introduction 

Statistical methods for analysing time-to-event data play an important role 
in a variety of applications including, for example, medical areas (survival 
analysis, infectious disease epidemiology), industrial settings (reliability test- 
ing), and demographic as well as socioeconomic problems (migration analy- 
sis, life events research). A problem frequently encountered in practice arises 
from the fact that the observed data are usually incomplete with respect to 
the exact event times of interest as the occurrence of a secondary event 
precludes knowledge of the time of occurrence of the primary event under 
investigation. Such a situation may be due to limitations on the length of 
the study, loss to follow-up, occurrence of competing events, or withdrawal 
from the study. Data observed under this setting are usually referred to as 
{ right-) censored. 

The hazard rate is nowadays used extensively as a methodologic tool in this 
kind of application to describe the instantaneous risk of observing the event 
of interest over time. Its comprehensible interpretation has led to widespread 
use in applied work addressing the topics mentioned above. Simultaneously, 
its nonparametric estimation from censored data via kernel methods has 
received considerable attention in the statistical literature, especially under 
the assumptions of the random censorship model (for earlier reviews, cf. 
Singpurwalla and Wong (1983), Padgett (1988)). 



2 Definitions 

Denote Yi, . . . , i.i.d. nonnegative random variables ( '‘failure times”) with 
distribution function F and density function /, and Ci, . . . , i.i.d. nonneg- 
ative random variables ( “censoring times”) which are assumed to be inde- 
pendent from yj for alH = 1, . . . , n. Under this setting of the so-called ran- 
dom censorship model one observes the bivariate sample 
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D := ((Xi,5i),...,(X,,5,))', where X, min(y,,a), := l{Yi < Q}, 
i = 1, . . . ,n. The distribution function of the observed data D is denoted 
by Fd. The hazard rate is then defined as 

h(t) := lim -^P(t < Yi < t + At\Yi > t), t > 0. 

For small At the product h{t) • At can be viewed as the conditional probabil- 
ity for observing the event of interest during the interval [t, t + At] provided 
that it has not been observed until t. Hence, the hazard rate h{t) can be 
interpreted as the instantaneous risk of observing a failure at time t. Equiva- 
lently, the hazard rate can be expressed as h{t) = /(t)/(l — F(t)) to underline 
its relationship to the density function. 

The estimation of the hazard rate without assuming any parametric model 
for the failure times can be approached employing numerous nonparamet- 
ric techniques. Apart from kernel methods that are studied in greater detail 
below estimation via smoothed histograms, splines, application of maximum 
penalized-likelihood procedures, fourier series techniques and local polyno- 
mials have been considered. The modern approach of wavelet decomposition 
is also conceivable but has not been analysed until yet. 



3 Kernel Estimators for the Hazard Rate 



Kernel methods as a special type of nonparametric functional estimation 
have their origin in density estimation. Today they are employed in a vari- 
ety of settings beyond density estimation (cf. Prakasa Rao (1983), Wand and 
Jones (1995)). The key idea of the kernel method consists in blurring the em- 
pirical mass given to each observed data point over some interval surrounding 
the data point. The length of this interval is determined by some bandwidth 
function, whereas the form of the distribution over this interval is controlled 
by a so-called kernel function K{-) satisfying the Parzen conditions (Wand 
and Jones (1995)). For the uncensored case, Watson & Leadbetter (1964) 
have considered kernel estimators of the hazard rate in detail. The censored 
situation, however, poses additional mathematical problems which have been 
tackled during the last fifteen years. 

The general form of a kernel estimator h for the hazard rate h in the case of 
censored observations can be written as 



Hi) = 



i=l 



n 



H) i 

■i + l bni{D,t) 



K 



t - ^(i) 

bni{D,t) 



bni{D,t) >0, t>0, 



where 5(j) denotes the censoring indicator corresponding to the i-th order 
statistic X(j) and bni{D^t) represents the bandwidth function (Gefeller and 
Michels (1992)). 

The numerous possibilities of specifying the bandwidth function lead to dif- 
ferent properties of the resulting hazard rate estimators, whereas the effect 
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of the particular kernel function is only marginal (except for the behaviour 
at the boundary). All direct kernel estimators of the hazard rate proposed 
in the literature can be embedded in the general form of h given above and 
can be classified according to the incorporated type of the bandwidth func- 
tion as follows {R{kn, y) denotes the distance from y to its fc^-th nearest 
neighbour among the observations in D, bn > 0, bn{t) > 0 Vt > 0): 

1. fixed-bandwidth kernel estimator: bni(Dfi) = 5^, 

2. local-bandwidth kernel estimator. bni{Dfi) = bn{t), 

3. nearest neighbour kernel estimator: bni(D^t) = 

4. variable kernel estimator: bni{D^t) = R{kn,Xi). 

Asymptotic properties of these kernel-type estimators for the hazard rate 
have been extensively investigated in the statistical literature. In the follow- 
ing subsections a brief review of the results of these studies is provided. 

3.1 Fixed-Bandwidth Kernel Estimator 

If bni{D,t) = bn, the resulting bandwidth is independent of the observed 
data D and the point of interest t. The hazard rate estimator employing 
this constant bandwidth (hfix) has been studied extensively in the statistical 
literature (cf. Ramlau-Hansen (1983), Tanner and Wong (1983), Yandell 
(1983), Diehl and Stute (1988)). It has been shown that for bn 0 and 
oo (sometimes it is necessary to require oo) the following 

asymptotic properties hold: 

• hfix{t) is an asymptotically unbiased estimator of h{t). 

• hfix{t) is mean square consistent for h{t). 

• hfix{t) is asymptotically normally distributed (pointwise). 

• The standardized process corresponding to hfix{t) can be approxi- 
mated by the the convolution of a Brownian bridge with ^X(^) 
allowing the construction of simultaneous confidence bounds. 

• Depending on varying assumptions different results on the speed of 
convergence of hfix can be found in the literature. 

Despite these promising asymptotic properties, in practical applications it 
has been consistently observed that a globally constant bandwidth leads 
to undesirable effects whenever the data are not equally distributed over 
the whole range of interest. In regions with many observations, the fixed- 
bandwidth kernel estimator tends to oversmooth, whereas when data are 
sparse, the estimator is usually undersmoothed revealing misleading peaks. 
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3.2 Local— Bandwidth Kernel Estimator 

Minimizing the asymptotic expression of the mean squared error (MSE) 
for hfix at each t > 0 with respect to the bandwidth yields the local 
M5£'-optimal bandwidth function 6„(-) given by 

1 hit) 

B^K) (1 - Foit)) ■ [h"it)Y 

with V{K) := J K‘^{u)du, B{K) := f u^K{u)du. Defining bni{D,t) = bn{t) 
leads to a kernel estimator of the hazard rate (hioc) with varying bandwidths 
independent of the observed data. The crucial problem of applying this idea 
in practice relates to the construction of bn{t) as minimizing the MSE a>t 
each t > 0 depends on the (unknown) true hazard rate /i, its second deriva- 
tive /z", and the distribution function F^. Muller and Wang (1990) have 
proposed a solution to this problem which is based on an iterative proce- 
dure employing pilot estimates of the unknown quantities derived from the 
data D to yield bn{t). Under weak assumptions on the asymptotic behaviour 
of the pilot estimates they have proved the property of weak convergence of 
hioc to a Gaussian process and have derived expressions for bias and variance 
of hioc{t). In Miiller and Wang (1994) the implementation of this proposal is 
illustrated and practical hints for the application of the procedure are given. 

3.3 Nearest Neighbour Kernel Estimator 

As another approach to remedy the practical problems of hfix mentioned 
above, the nearest neighbour idea (originally proposed in the context of 
nonparametric discriminatory analysis) has been advocated to be incorpo- 
rated into the definition of the bandwidth function, i.e., bni{D^t) — bn{D^t) 
has been defined as the distance from t to its kn~th nearest neighbour among 
the observations in D. It should be noted that for censored data a gener- 
alization of the usual definition of nearest neighbour distances is required 
(Gefeller and Dette (1992)) and that the choice between different proposals 
is not unambigous (Dette and Gefeller (1995)). 

The nearest neighbour kernel estimator of the hazard function (Hnn) has 
been shown to converge (almost sure) to h at each point of continuity 
of h provided that the nearest neighbour sequence kn is proportional to 
n^, a e (0.5, 1), as shown by Tanner (1983). Other versions of Hnn have 
been considered by Liu k, Van Ryzin (1985) and Cheng (1987). These 
authors have proved asymptotic normality and strong consistency of 
Mathematical disadvantages of are due to the fact that the bandwidth 
function defined by the kn~th. nearest neighbour distance is not differentiable 
at alH > 0 and that the integral from 0 to oo over is not bounded in 
general. 
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3.4 Variable Kernel Estimator 

The mathematical disadvantages of have initiated another proposal to 
estimate the hazard function. Analogous to an idea in density estimation, 
Tanner & Wong (1984) suggested to define hni{D^ t) — bni{D) as the distance 
between Xi and its fc^-th nearest neighbour among the remaining observa- 
tions. This definition of the bandwidth function is independent of the point 
of interest t and adapts only to the configuration of the observations. Using 
this bandwidth function yields the so-called variable kernel estimator of the 
hazard function {hyar) which is differentiable for appropriate kernel functions 
K at alH > 0 and whose integral is bounded. It should be recognized that, 
contrary to hfix and Hnn, neither the bandwidth of hyar is globally constant 
(as in hfix) nor is the number of observations influencing hyar fixed (as in 
^nn)- Schafer (1985) has tackled the mathematically complicated problem 
to derive asymptotic properties of hyar- He has shown that hyar converges 
uniformly to h on each compact interval [a, b] with 0 < a < 6 and F{b) < 1. 

3.5 Modification of the Estimator at the Boundary 

A major general drawback of kernel estimators relates to their bias prob- 
lems when estimating near endpoints of the data. This phenomenon (termed 
boundary effects in the literature) arises as the support of the kernel func- 
tion exceeds the range of the data. In the context of hazard rate estimation 
the problem is exacerbated in cases where the hazard rate has high deriva- 
tives near endpoints which often appears in practice (e.g. for the typical 
bathtub-shaped hazard rates). One published “remedy” to this problem is 
to avoid displaying curve estimates in the boundary region. For hazard rates 
this advice leads to hiding information which is often — at least for the left 
boundary region near zero — of particular interest in applications. Thus, the 
idea to change kernel functions near and at endpoints in a way to counter the 
bias in this region is especially compelling for the hazard rate setting. This 
method of the so-called boundary kernels has been worked out in consider- 
able theoretical and practical detail during the last two decades. Optimum 
boundary kernels can now be rather easily constructed (Muller (1991)). A 
detailed comprehensible description of the procedure in the context of hazard 
rate estimation has been provided by Muller and Wang (1994). 



4 Examples 

Practical applications of the nonparametric methodology for the estimation 
of hazard rates from censored data can be found in a variety of different dis- 
ciplines. An overview of published examples is given by Gefeller and Michels 
(1992). Due to its computational simplicity the fixed-bandwidth kernel esti- 
mator is the most popular type of kernel estimator in applications. Gefeller 
and Michels (1992) also present two worked-out examples: one from the 
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medical field comparing the survival experience of two patient groups un- 
dergoing different therapies for gastric cancer (employing the variable kernel 
estimator), and another illustration of the methodology using data on the 
duration of stay of Spanish foreign workers in Germany before returning 
to their native country (employing the fixed-bandwidth kernel estimator). 
Michels and Gefeller (1994) provide an example from marketing research. 
They analyse data on the duration of membership in a credit card company 
after a member acquisition mailing campaign. In this context, the hazard 
rate describes the instantaneous risk of cancellation after a certain period of 
membership. 



5 Computational Aspects 

5.1 Bandwidth Selection in Practice 

The crucial point in the practical realization of the method lies in the partic- 
ular choice of the smoothing parameter (the constant bandwidth in hfix^ 
the number of nearest neighbours kn involved in the calculation oifiNN nnd 
hvar^ nnd the bandwidth function 6„(-) in hioc)- Intuitively, if the bandwidth 
is too small, there is too much “variance” in the sense that structures belong- 
ing only to the particular data at hand, and not to the underlying hazard 
rate, may be seen in the estimate. On the other hand, if the bandwidth is 
too large, there is too much “bias” in the sense that essential structures of 
the hazard rate are smoothed away. Finding the proper balance between 
too much “variance” and too much “bias” constitutes the difficult task of 
bandwidth selection. 

Apart from a number of explorative methods employing some more or less 
formalized trial-and-error procedure to derive at a suitable selection of the 
smoothing parameter, plug-in procedures and cross-validation techniques 
have been advocated as an objective way of determining an appropriate 
bandwidth. Especially, the so-called modified likelihood criterion aiming at 
minimizing the Kullback-Leibler information loss function 
£■[— / /i(^) • log (^(t))dt] has been studied in the context of hazard rate 
smoothing (Tanner and Wong (1984)). However, the quadratic loss function 
E[J{h(t) — h{t)Ydt] can also be used to obtain values for the bandwidth 
parameters by applying the cross-validation idea to this setting (Sarda and 
Vieu (1991)). It should be noted that application of these methods for se- 
lecting smoothing parameters requires an enormous computational effort. 

5.2 Software 

Although kernel methods to estimate the hazard rate from censored data 
enjoy growing popularity, the methodology has surprisingly not been im- 
plemented in any of the large standard packages for statistical analyses. 
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Specialized software systems like, for example, XploRe offer only the oppor- 
tunity to apply kernel methods to the problem of nonparametric estimation 
of regression curves, but no package has been developed covering hazard 
rate estimation. To overcome this unsatisfactory present situation regard- 
ing software (un-) availability, a comfortable menu-driven program termed 
KEHaF2e (Kernel Estimation of Hazard Functions, second enhanced ver- 
sion) has been developed. The software KEHaF2e, written in C and oper- 
ating under all WINDOWS and DOS environments, computes and displays 
fixed-bandwidth, nearest neighbour and variable kernel estimators of hazard 
rates and density functions. It offers the opportunity to select a kernel func- 
tion (with and without boundary modification) among several candidates. 
Furthermore, it supports an explorative and interactive procedure for the 
selection of smoothing parameters, where graphical displays serve the user 
as a guide to a suitable choice. 



6 Final Remarks 

In this paper kernel-type estimators of the hazard rate constructed by a 
direct smoothing of the observed data have been briefly reviewed. Due to 
their computational simplicity, encouraging asymptotic properties and good 
practical performance, kernel methods are in widespread use in practical ap- 
plications. The only obstacle for an even wider utilization results from a lack 
of appropriate software with respect to an easy-to-use access to this method- 
ology. The program KEHaF2e closes this gap in nonparametric functional 
estimation in failure time analysis. Program and documentation are avail- 
able free of charge (for non-commercial users) from the author on written 
request (please provide a formatted diskette when requesting KEHaF2e). 
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Abstract: In the framework of estimating finite mixture distributions we con- 
sider a sequential learning scheme which is equivalent to the EM algorithm in 
case of a repeatedly applied finite set of observations. A typical feature of the 
sequential version of the EM algorithm is a periodical substitution of the esti- 
mated parameters. The different computational aspects of the considered scheme 
are illustrated by means of artificial data randomly generated from a multivariate 
Bernoulli distribution. 



1 Introduction 

The problems of sequential estimating finite mixture distributions arise rou- 
tinely in the fields of pattern recognition and signal detection, frequently in 
the context of unsupervised learning and neural networks. The observations 
are assumed to be received sequentially, one at a time and the estimates of 
parameters have to be updated after each observation without storing the 
observed data. Recently (cf. Grim (1996), Vajda, Grim (1997)) we conside- 
red a probabilistic approach to neural networks based on finite mixtures and 
the EM algorithm. In this case the existence of a sequential version of the 
EM procedure is an important condition of neurophysiological plausibility. 
Sequential methods of estimating finite mixtures have been considered by 
many authors (cf. Titterington et al. (1985), Chapter 6 for a detailed discus- 
sion). However, in most cases the problem is not formulated in full generality 
and/or the solution is computationally intractable for multivariate mixtures. 
Also the methods are usually related to stochastic approximation techniques 
and therefore the important monotonic property of the EM algorithm is lost. 
In the present paper we consider a sequential scheme which is equivalent 
to the EM algorithm in case of a repeatedly applied finite set of observa- 
tions. As the equivalence does not apply for nonperiodical sequences of 
data we use the term pseudo-sequential EM algorithm. A typical feature of 
this scheme is a periodical substitution of the estimated parameters. The 
updated parameters are not substituted into the estimated mixture immedi- 
ately after each observation but only periodically after the last data vector 
of the training set. The considered pseudo-sequential procedure suggests 
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some possibilities to speed up the EM algorithm and simultaneously, there 
is a natural possibility to extend the pseudo-sequential scheme to infinite 
sequences of observations. Different computational aspects of the present 
paper are illustrated by means of artificial multivariate binary data. 



2 EM Algorithm 

Let X = (xi, • • • , xn) be a vector of binary variables x G {0, 1}^ and P{x) 
be a finite mixture of multivariate Bernoulli distributions 

P{x) = f{m)F{x\m), F{x\m) = J] fn{x„\m), (1) 

uiEM. 

fn{Xn\m) = e2^{^ - Y /M = 

mEM. 

M = Af={l,2,...,N}. 

Here /(m) > 0 is the a priori weight of the m -th component and /n(^n|^) 
are the related discrete distributions of the binary random variables. 

The EM algorithm can be used to compute maximum-likelihood estimates 
of the involved parameters (cf. Dempster et al. (1977), Grim, (1982)). We 
assume that there is a set S of independent observations of a binary random 
vector 

S = {xi,X2,...,Xk}, «fcG{0, 1}^ (2) 

with some unknown distribution of the form (1). The corresponding log- 
likelihood function 



1 M 

^5 = ^ ( 3 ) 

1^1 rcGX m=l 

can be maximized with respect to the unknown parameters by means of the 
following EM iteration equations: 



/./ 1 X f(m)F(x\m) .. ^ 

/(™) = meM, 



(4) 

(5) 



/:(e|m) = - L^^^(e,x„)/(m|x), e€{0,l}, neAf. (6) 

2^x65 /l^FJ Ig5 

Here /, f'„ are the new values of parameters and x„) denotes the delta- 
function. The EM algorithm produces a nondecreasing sequence of values 
of the log-likelihood function converging to a local or global maximum of 
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Ls- The proof of convergence properties is largely based on the following 
inequality first proved by Schlesinger (1968) for successive values Ls,L'^'- 






1 ’IT — \ \ «/ I \ , 



+ E /(“) E E /.KNiosfiS £ «■ P) 

For a detailed discussion of different aspects of convergence see e.g. Demp- 
ster et al. (1977), Grim (1982), Wu (1983), Titterington et al. (1985). 

In the following sections we illustrate different computational aspects of the 
considered procedures by means of artificial data randomly generated from a 
16-dimensional Bernoulli distribution. The parameters of the source mixture 
having three components were chosen randomly from suitably defined inter- 
vals (cf. Grim (1983)). In order to avoid any small sample effects we used 
a sufficiently large data set of 10000 binary vectors. The lower five-tuple 
of curves on Fig. 1 shows typical convergence curves of the EM algorithm 
starting form five different randomly chosen points. For the sake of compar- 
ison the same sets of initial parameters have been used in all computational 
experiments. 



3 Pseudo- Sequential Equivalent of the EM 

Algorithm 

We create an infinite data sequence by repeating the finite set (2): 

x^^^ — Xk€S, k = {t mod K) + 1. (8) 

It is easily verified that the EM algorithm (4) - (6) can be equivalently 
rewritten as follows: 



f{m\x^*^) = 



f{m)F{x^^^\m) 



meM, t = 0,1,2,... 



( 9 ) 



ZjeMfU)Fmr 
/0+i)(n^) = (1 - i)/W(^) + i/(n^|a;W), f^°\m) = = 0, (10) 

/'(m) = f‘'^\m), fni^\m) = f^^\^\m), ^ 6 {0, 1}. (12) 

As expressed by Eqs. (12) the updated parameters f^*\m), f^\^\m) are not 
substituted into f'{m), f'n{^\m) immediately after each observation but only 
periodically, at the end of each cycle, i.e. for t = K. 




166 



It should be emphasized that the EM algorithm and its sequential version 
(9) - (12) are equivalent in the sense that they produce identical sequences 
of parameters for identical initial values. Obviously, the important mono- 
tonic property (7) and all the well known convergence properties of the EM 
algorithm remain valid for the sequential scheme (9) - (12). Nevertheless, 
we use the term “pseudo-sequential” because the equivalence does not apply 
to data sequences which are not periodical. Fig. 2 illustrates the nonmono- 
tonic behaviour of the pseudo-sequential EM algorithm when it is applied 
to a nonperiodical sequence of data. In Sec. 6 we suggest a truly-sequential 
version of the EM algorithm but the justification is only heuristical. 

Note that the initial values f^\(\m) are irrelevant since for t being 

a multiple of K the first term on the right-hand side of Eqs. (10), (11) 
is zero. Let us recall also that the sequential procedure is invariant with 
respect to the order of data vectors between substitutions (cf. (5), (6)). 
Remark. The periodical substitution of parameters can be interpreted from 
a neurophysiological point of view. It is generally assumed that the adap- 
tivity of neurons is based on some relatively slow biochemical processes. 
For this reason the functional properties of neurons cannot be expected to 
change continuously, as an immediate consequence of a specific activity of 
neurons. We can rather assume that the functioning of neural network specif- 
ically infiuences e.g. the concentration of some chemical stuffs or energetic 
balance of neurons and, in this way, some metabolical changes or growth 
processes responsible for adaptation can be initialized. Consequently, some 
delay would occur between a specific activity of a neuron and its adaptive 
changes. In this sense, periodical substitutions could correspond to sleep 
phases or to daily cycles. The invariance of adaptive changes with respect 
to data ordering is also a relevant argument for the present interpretation. 



4 Truncated Iteration Cycle 



Motivated by the sequential EM algorithm (9) - (12) we consider first some 
possibilities to speed up the convergence of the EM algorithm. As it can 
be assumed in case of randomly chosen starting points, the estimated pa- 
rameters usually change substantially at initial phases of computation. In 
other words, at initial iterations the computed estimates are “handicapped” 
by their previous inaccurate values influencing the weights (9). For obvious 
reasons this handicap cannot be fully removed by using larger data set but, 
on the other hand, it could be possible to save computing time by using only 
some smaller portion of data when computing the initial “rough” estimates. 
Proceeding along this line we assume a partition of the sequence S in two 
parts S = <SiU<S 2 and define the parameters /^^'(m), in analogy with 

(5), (6) for S ^ Si, i = 1,2. We are interested to guarantee the monotonic 
condition (7) by the parameters /t^l(m), /^^'(^|^) based on a subset Si C S 
since, in this way, we could save computing time. We can write (cf. (7)) 
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-Ls> 



/W(m) log 

1^1 m£M 



f{m) 



+w i: /'"'("*) i»8 

1^1 m£M 



/M 



, Y' V' r l*^i|/^^^(^) 



Y, /n'(^l"i)l0g 

«€{0,1} 






l«5|/'(m) 



J2 /n’(^l"^)log 
Ce{o,i} 



fn{^\m) ^ 



and further, using notation 



/(/'(•),/(•)) 



f'{rn)log 

m^M. 



/'M 
/M ’ 



(13) 



(14) 



we can write the inequality 



-L5>^/(/W(-), /(•)) + 



I«52| . „ 



/(m) 



+ E E { 



I<?il 

\Sr\ + \S2\/mm) 



HfnK-\m),f„{-\m))+ 



(15) 



+ 



I.S2I . 

|‘52| + |<5l| * /W(m) «6 {o!i} 



{log 






If the right hand side of the last inequality is positive then the increase of 
the criterion Ls is guaranteed without including the remaining data S 2 into 
computation. This condition can be used to choose the minimum necessary 
size of the sequence S\ since all the involved quantities are available at each 
step of the sequential process. As a result we would obtain so called gene- 
ralized EM algorithm (cf. Dempster et al. (1977)) with similar properties. 
Unfortunately, the lower bound obtained in (15) is probably to rough, since 
in our numerical experiments we achieved only small savings in initial phases 
of computation. 



5 Intermediate Updating of Parameters 

Another way to speed up the convergence of the EM algorithm is to uti- 
lize the information accumulating sequentially in the parameters (10), (11) 
before the end of the substitution period. In particular, we assume a finite 
partition 

S=[jSi, Ni = j2\Si\, (Ni = \S\) 

i=l l=l 



(16) 
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and define the intermediate updates of the estimated parameters 

/'*'M = I] ^ /(•). (18) 



* l=l x£Si 



fnK^\rn) = „ .L . ^2 J2 ^€{0,1}. (19) 

At the end of the substitution period (i = I) the estimated parameters are 
given by 

/(m) = fn{^\m) = fH\^\m), ^ € {0, 1}, m e M, n e N. (20) 



Eqs. (17) - (20) correspond to one iteration of the EM algorithm but they 
are not equivalent to the original Eqs. (4) - (6). The increment of the log- 
likelihood function corresponding to the Eqs. (17) - (20) can be expressed 
in the form 



Ls-Ls = t:^^^Y. Y. ^'(^1^*^) log 






G*Si m^M 



f{m\x) 

f{m\x) 



+ Z) f{m)log 

m^M. 



f{m) 



+ E /WE E /«(«H‘<>sMS' 

meM nGAT^6{0,l} 



(21) 



Generally, expression (21) may be negative because of the first sum, but we 
had to use a very small data set (|<S| 10^) to demonstrate the nonmono- 

tonic behaviour of the sequential procedure (17) - (20). In computational ex- 
periments the intermediately updated parameters (20) essentially improved 
the initial iterations. 

It appears that a fixed partition (16) increases the increments of initial it- 
erations but disturbs the final convergence. In accordance with this idea 
we obtained the best results by making the partition (16) coarser after each 
substitution and by using unpartitioned set S in the final stages of compu- 
tation. The convergence curves obtained for \Si^i\ = + 500z are shown 

on Fig.l (upper five-tuple of curves). 

Let us recall that by intermediate updating of parameters we obtain a pro- 
cedure which is not more equivalent to the EM algorithm and therefore the 
basic convergence properties are not guaranteed. Nevertheless, we can treat 
the initial computation as a heuristical improvement of starting values, as 
long as it holds |<Si| < |<S|. Further iterations using the unpartitioned set S 
correspond to the standard EM algorithm again. 





Fig 3. A sequential modification of EM algorithm (cf. Sec.6) applied to infinite 
sequence of data sets (comparison with Fig.2). 
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6 Concluding Remarks 

It can be seen that Eqs. (17) - (20) represent a truly-sequential procedure 
for an infinitely large set S and for / -> oo. However, the corresponding 
computational experiments have shown a relatively slow convergence (cf. 
Fig.3). The value of the criterion is given by the formula 

-[ i M 

T T M (22) 

l=l xeSi m=l 
l=l 

and the iteration steps are recomputed to the multiples of 10000 in analogy 
with Fig. 2, though an exact comparison with periodical sequences is not 
possible. 

Let us recall also (cf. Sec. 5) that, in general, the convergence properties of 
the truly-sequential procedure are not guaranteed. 
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Abstract: Many cluster algorithms only allow the calculation of a partition 
without the possibility of evaluating the stability or variability of the solution due 
to the randomness of the sample. Resampling methods as the bootstrap provide 
a general framework within which one can analyse the stability of the results of a 
cluster analysis. We use it in the context of investigating psychological concepts 
based on variables of a questionnaire. We propose several measures to evaluate 
the variability of the clustering and exemplify the approach with a study on belief- 
attitudes of adults. 



1 Introduction 

In this paper we study a resampling approach for evaluating the stability of 
a partition derived from a cluster analysis. A main distinction can be made 
in cluster analysis between the problem of clustering a sample of objects 
for which several variables have been measured or of clustering the variables 
themselves. In both cases it is desirable not only to present a partition as the 
solution of some cluster algorithm but to accompany it with the assessment 
of its variability due to sampling the objects. In general, this is a difficult 
task because one has to describe the high dimensional distribution of the 
partitions. In the first case there are model based approaches, e. g. one 
assumes that the data follows some class-specific distributions like normal 
distributions. A variability assessment can be done by testing the existence 
of a clustering structure (Bock (1996)). If the form of the underlying dis- 
tributions is questionable or the sampling distribution of the descriptors of 
the partitions is intractable resampling methods form a general framework 
to tackle the problem. Jhun (1990) used bootstrap in the case of k-means 
clustering. Our clustering task occurs in psychology and requires partition- 
ing the variables but not the objects. Therefore the aforementioned model 
based approaches cannot be applied. The variables are also called questions 
or items and the objects are the persons who answered the questionnaire. 
In psychology one represents a psychological concept by a set of variables 
that allow the measurement of the concept. In the following we will use the 
term concept also for the corresponding set of variables. Empirical studies 
serve the evaluation of the stability of such concepts where it is especially 
important to investigate the results with respect to their sensitivity to the 
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sampling process of the objects. For any cluster algorithm one may study 
this variability by analysing the distribution of the partitions calculated from 
bootstrapped samples of the original data. We will use easily depictable mea- 
sures of this distribution to exhibit the variability of the clustering. Note, 
that we are not concerned with the general evaluation of cluster algorithms 
with respect to their stability or their ability to recover an underlying struc- 
ture (cf. Baker, F. (1974), Milligan, (1980)) but with the stability of the 
clustering of some data considered variable with respect to sampling. 



2 Method 

The sample distribution A of the partitions constructed with a cluster algo- 
rithm is approximated by the non parametric bootstrap (Efron (1979)). Let 
{Xi = (Xii , . . . , Xip ) , i = 1, . . . , N}, denote a sample of N observations each 
observation consisting of the measurements of p variables. A bootstrap sam- 
ple (X*)^^;^ ^ is constructed by drawing with replacement N times from 

the original sample giving equal weight of 1/N to each observation. Actu- 
ally, we draw B bootstrap samples ^ , 6 = 1, . . . , B, compute for 

each sample a partition A*^ and consider the empirical distribution A* of 
these B partitions as an approximation for A. 

We now describe several approaches to analyse this distribution in order to 
study the variability of the estimation of a partition. First, we examine the 
distribution with respect to the existence of several modes. Due to the high 
dimensionality of the space of partitions the frequency of a specific partition 
will be too small in order to be taken as an estimate for the probability of 
the partition. We therefore estimate the mass of the neighbourhood, which 
has some diameter A, of a partition. We take the fraction of the B partitions 
that lie in this neighbourhood as an estimate. A partition is considered to 
be a potential mode, if in its neighbourhood there is no partition with a 
larger estimated mass. The distance measure d{A, A') between two parti- 
tions is taken as 1 minus the adjusted Rand-Index (Rand (1971)) between 
these partitions. It is defined as the ratio of the number of concordant and 
discordant pairs to all possible pairs. Two variables are called concordant, 
if they cluster in both partitions, and they are called discordant, if they are 
in different clusters in both partitions. For comparability the adjustment 
incorporates a random assignment of the variables to the partitions (Hubert 
and Arabie (1985)). 

We now turn to a measure rrik that describes the similarity of the clusters 
which have the k-th. variable in common across the sampled partitions. The 
similarity of two clusters C and C' containing the A:-th variable is measured 
by their number of common variables relative to all the variables of the two 
clusters. We consider the mean taken over all possible pairs of partitions, 

^ 1 #(anC50 

J5(S-l),^,#(auCy)- 
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63). We refer to the partitioning of the data into these five sets as the con- 
ceptual partition. The last two concepts are well established and taken from 
the ’Freiburger Personlichkeitsinventar (FPI)’ (Fahrenberg et. al. (1994)). 
The other three concepts are less well established and based on a former 
study (Mischo (1991)). One aim of the analysis is to exhibit the relative 
stability of the five concepts with respect to the new data collected. Since 
the variables are measured on an ordinal scale with two or four levels Spear- 
man’s correlation coefficient was used as the similarity measure between the 
variables. We analyse mainly the five cluster solutions. The analysis is based 
on a sample of partitions calculated for B = 1000 bootstrap samples of the 
original data. 

The analysis of modes of the distribution of these partitions showed two 
distinct modes with respect to a neighbourhood of A = 0.04. The partition 
with the highest mass {p = 0.6) of its neighbourhood is equal to the con- 
ceptual partition of the variables into the five concepts. The second mode 
{p = 0.4) is a partition derived from the first one by merging the variables 
from religion and magical-irrational thinking, the four variables 9, 39, 44 and 
48 now forming one cluster. 



8 




varbble 

Figure 1: Average similarity of clusters across partitions 

In Fig. 1 we depict the average similarity of the clusters across the parti- 
tions with respect to each variable. Especially the variables of the first three 
concepts appear to be quite homogeneous with respect to this measure. The 
higher levels for the variables of the magical-irrational thinking in compari- 
son to those of astrology is mainly due to the bimodality of the distribution 
of the partitions. Obviously, the clusters for variables 9, 39 and 48 are most 
inhomogeneous across the partitions. One disadvantage of this representa- 
tion is that it gives no indication of those variables with which a variable is 
most likely to be clustered. 
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In order to evaluate the relative variability of clustering the variables we also 
consider the probability of a set S of variables to be clustered. If A denotes 
the set that contains the clusters of a partition and all their subsets, we can 
write this probability as 



P{S):=P{{A\S€A}), (1) 

which will be estimated by ^ I ^ ^ The consideration of all 

possible pairs of variables allows a simultaneous graphical representation of 
their corresponding clustering probabilities by a two dimensional grey-scale 
matrix. For sets with a larger number of variables there is no such simple 
representation. In order to get insight into the variability of larger sets we 
start with one interesting set of variables, e.g. a cluster found in the partition 
of the original data, and delete successively that variable that maximally in- 
creases the probability for the remaining set of variables. Additionally, one 
could ask for the smallest set of variables that has got a prespecified proba- 
bility, say 95%. But at the moment we know of no efficient algorithm to solve 
that problem. In analogy to (1) one can define the probability that a set of 
(preferably disjoint) sets of variables 5i, . . . 5^ is consistent with partitions, 
i.e. P{{A I Si G A,...,Sk ^ A}). We concentrate on a reasonable choice 
from all possible combinations of sets by first regarding the partition of the 
original data. Starting from this partition we construct the other sets of sets 
of variables by successively deleting the variable for which the corresponding 
probability maximally increases. 

Finally, the average interpairdistance of all bootstrapped partitions will be 
taken as a global measure for variability. Alternatively one could take the 
average distance to some reference partition for example the partition of the 
original sample. 

We use a cluster algorithm proposed by one of the authors (Schweizer (1991)) 
which is agglomerative with reallocation steps at each agglomerative level. 
The similarity measure between two sets of variables is equal to the abso- 
lute value |r| of a hypothetical homogeneous correlation r between all the 
variables, such that the observed correlation between the means of the two 
sets equals the one based on the homogeneity assumption. Schweizer et al. 
(1996) could show in a Monte Carlo study that this cluster algorithm is able 
to recover a partition over a wide range of random disturbances. 



3 Example 

We illustrate the methods with data based on a study (Mischo (1996)) con- 
cerned with general belief-attitudes of 385 adults. We analysed a subset of 
the variables of a questionnaire comprising five concepts, religion (variable 
1 to 8), astrology (variable 9 to 17), magical-irrational thinking (variable 18 
to 35), extraversion (variable 36 to 49) and emotional lability (variable 50 to 
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This can be seen from Fig. 2 where the estimated probabilities for pairwise 
clustering of two variables are given. The variables have been ordered with 
respect to the first eigenvector of a correspondence analysis of the matrix of 
the estimated probabilities in order to achieve a better graphical represen- 
tation. It conveys the overall impression that most of the variables which 
belong to the same cluster in the conceptual partition are pairwise found in 
the same cluster with a probability of more than 90%. Between most of the 

var itblt 







Figure 2: Estimated probabilities of pairwise clustering of variables 

variables of astrology and of the magical-irrational thinking the probability 
of pairwise clustering is larger than 20%. This tendency of clustering again 
reflects the bimodality of the distribution of the partitions. The comparison 
of the variables 39, 44 and 48 with the other variables of the concept of ex- 
traversion shows that the estimated pairwise probabilities are smaller than 
80%. In contrast, the pairwise probabilities between the three variables 
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themselves exceed this value. Additionally, if one looks at the six cluster 
solution, these three variables form a cluster and the variables of astrology 
and magical-irrational thinking form two distinct clusters. Solutions with 
more than six clusters generally show decreasing probabilities for pairwise 
clustering and no new groups of variables occur that cluster pairwise with 
high probability. One may thus argue, that the conceptual partition into 
five clusters is appropriate for the data, if one takes into account, that three 
of the variables form a special group within the concept of extraversion. 

In Fig. 3 we focus on the stability of the concept of extraversion. In about 
40% of the partitions the variables 36 to 49 are found in a common cluster. 
After removing variable 39, the probability for the remaining set of variables 
is maximally increased to a value of about 58%. Further removing variables 
48, 44 and 36 we get a set of variables that cluster in 95% of the sampled 
partitions. This strengthens the observation that the concept is quite stable 
apart from three or four variables. 

In Fig. 4 we looked at the consistency of the conceptual partition with the 
sampled ones. 21% of these partitions coincide with the conceptual partition. 
If we disregard variable 35, 30% of the sampled partitions are consistent with 
the remaining sets of variables. At the level of 95% most of the variables of 
religion and of the magical-irrational thinking are still present. The average 
pairwise distance of the sampled partitions has a value of 0.2. It should be 
noted that a histogram of the pairwise distances shows two distinct peaks. 
This again refiects the existence of two modes for the five cluster solution. 



4 Discussion 

We demonstrated the applicability of the bootstrap method as an exploratory 
tool for investigating the sample variability of a clustering solution for some 
cluster algorithm. From the measures presented the probabilities that de- 
scribe the pairwise clustering of variables seem to be most revealing, since 
they allow a simultaneous representation of all the variables and their mutual 
likelihood to cluster. Emphasising the exploratory aspect we did not analyse 
conditions for which the bootstrap distribution of the partitions converge to 
their sample distribution. For clustering based on the sample covariance ma- 
trix the convergence result derived by Beran and Srivastava (1985) is of key 
importance. Some caution will be in order if the observations are extremely 
inhomogeneous, but we think that the method is generally applicable to 
avoid overintepreting the stability of a given cluster solution. 
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Abstract: Three models for linear regression clustering are given, and corre- 
sponding methods for classification and parameter estimation are developed and 
discussed: The mixture model with fixed regressors (ML-estimation), the fixed 
partition model with fixed regressors (ML-estimation), and the mixture model 
with random regressors (Fixed Point Clustering). The number of clusters is 
treated as unknown. The approaches are compared via an application to Fisher’s 
Iris data. By the way, a broadly ignored feature of these data is discovered. 



1 Introduction 

Cluster analysis problems based on stochastic models can be divided into 
two classes: 

1. A cluster is considered as a subset of the data points, which can be 
modeled adequately by a distribution from a class of cluster refer- 
ence distributions (c.r.d.). These distributions are chosen to reflect 
the meaning of homogeneity with respect to the certain data analysis 
problem. Therefore c.r.d. are often unimodal. If the class of c.r.d. is 
parametric, then one is interested in classification of the data points 
and parameter estimation within each cluster. 

2. A cluster is considered as an area of high density of the distribution of 
the whole dataset. No distributional assumption is made for the single 
clusters. 

Clusterwise linear regression is a problem of the first kind since the points of 
each cluster are considered to be generated according to some linear regres- 
sion relation, i.e. one imagines a separate model for each cluster. The class 
of c.r.d. for the regression clustering problem contains distributions of the 
following kind: Consider a dataset Z = (:r'^, G {1} x EF^yi e R, I 

being some index set. 



C{yi\xi) = defined by 

Vi ~ + Uj, C,{up = A/(o,a^)) 

(/3,a2) xi^+ 



The first component of P denotes the intercept. The Ui,i E I are consid- 
ered to be stochastically independent. The Xi are called regressors in the 
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following. They can be fixed or random with C{xi) = G from some class of 
distributions Q, In the latter case the regressors are assumed to be i.i.d. and 
independent of {ui)i^i. Fg^p,ct‘^ then denotes the joint distribution of {xi,yi). 
In our setup, all parameters are considered as unknown. 

The models will be divided into fixed and random regressor models, and 
into mixture and fixed partition models. Mixture models treat the cluster 
membership of a point as random, fixed partition models contain parameters 
for the cluster membership of each point. A fixed partition model with 
random regressors will not be given because this does not lead to an easy 
clustering method. The purpose of the model based approach presented 
here is not to describe the mechanism generating the data, but to find an 
adequate description of the data themselves. Thus, all models can be applied 
to the same data. In particular, the question is ignored if the regressors were 
really fixed or random. 

The literature on clusterwise linear regression either treats the mixture 
model with fixed regressors (e.g. Quandt and Ramsey (1978), for general p 
and number of clusters DeSarbo and Cron (1988)) or discusses algorithms 
for a least squares solution (e.g. Bock (1969), Spaeth (1979)) which is re- 
lated to the fixed regressors fixed partition model presented here in the case 
of equal error variances for each cluster. This paper is based on the unpub- 
lished dissertation Hennig (1997b) where simulation results and proofs are 
given in full detail. 



2 Fixed Regressors, Mixture Model 

Let I be an index set, usually I = {1, . . . , n}. With a given regressor design 
{xi)i^i G ({1} X EPy the fixed regressors mixture model (FRM) is 
defined by 



i£l j=l 
s 

= 1, €j >0, j = 
j=i 



That is, s denotes the number of clusters and €j denotes the proportion of 
the cluster j. The log-likelihood function 



In L„(s, (cj, Z) = 

'-{Vi- P'jXj) 






exp 



2a? 




can be locally maximized for given s via the EM-algorithm described in 
DeSarbo and Cron (1988). This works only subject to a? > c Vj with 
some lower bound c > 0 (e.g. c = 0.001) since otherwise lnL„ would be 
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unbounded. After having performed the algorithm, point i can be classified 
to cluster 7(2) G {1, . . . , s} according to 



7(2) = argmaxe^j, Cij 
3 



^1=1 ^iy^(x'.j3i,af){yi) 



Cij denotes the estimated a posteriori probability for point i to be generated 
by mixture component j. 

The consistency proofs for FRM-ML estimation (Kiefer (1978), DeSarbo and 
Cron (1988)) suffer from not taking possible identifiability problems (Hennig 
(1996)) into account. 

DeSarbo and Cron (1988) suggest Akaike’s Information Criterion (AIC) for 
the estimation of s: 



s := argmaxlnZ/n(5) — k{s), k{s) = {p + 3)s — 1. 

s 



k{s) denotes the number of free parameters to estimate for the cluster num- 
ber s and lnZ/^(5) is the estimated maximum log-likelihood. Their simu- 
lations do not treat the performance of this proposal. The simulations of 
Hennig (1997b) show the tendency of the AIC to overestimate a small num- 
ber of clusters. Schwarz’ Criterion (SC) gives smaller estimates of s for 
n > and seems to work better: 



s \= argmaxlnL^(5) 



Inn 



k{s). 



The discussion of the Iris data example in section 5 illustrates this perfor- 
mance. Up to now there are no theoretical results on the performance of the 
AIC and SC for linear regression mixtures. 

Some alternative proposals for parameter estimation within this model were 
made (e.g. Quandt and Ramsey (1978)), but they lead to greater numerical 
difficulties and were investigated only for 5 = 2,p = 1. 
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Figure 1: Assignment independence - assignment dependence 

The implicit assumption of assignment independence is a disadvantage of 
the FRM. That is, the clusters keep the same proportions Cj, j = 1, . . . , 5 for 
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every fixed regressor x, (see figure 1). The probability of a point (xj, to be 
generated by cluster j is independent of x and i. This is not generally true. 
For example in a change point setup, the cluster membership is considered 
as determined by x or i Methods concerning this particular assumption 
can be found e.g. in Krishnaiah and Miao (1988). Also for the Iris data in 
section 5, assignment independence seems not to be fulfilled. 



3 Fixed Regressors, Fixed Partition Model 

In the fixed partition approach, the cluster membership of each point i is in- 
dicated by a parameter j{i). Thus, general kinds of assignment dependency 
can be modeled. The fixed regressors fixed partition model (FRFP) is 
given by 



^{{yi)iei) — ^ ^ 2 V 

7 : {1 ,..., 5>, 

{^i)iei ^ again given fixed. Under known s, ML-estimation is also 

possible within this model. The log-likelihood function is given by 



lnL^(s,7, 

-IE E fln2,r + ln„;+ <"‘~J^‘>’ 



j=l 'y{i)=j 



CT- 



( 1 ) 



For given (1) is maximized according to 



7(i) = argmin Ind? + 



{yi - 



( 2 ) 



For given 7, (1) is the sum of the usual log-likelihood functions for homoge- 
nous linear regressions within each cluster. Therefore, it is minimized by the 
LS-estimator /3j from the points {xi,yi) with 7(i) == j and 






Yl‘y(i)=j{yi Pj^i) 



rid 



= j). 



l,...,s. (3) 



That is, lnL„ is monotonely increased if the steps (2) and (3) are carried 
out alternately. This algorithm leads to a local maximum in finitely many 
steps since there are only finitely many choices for 7. In my experience, 
this is noticeably the fastest algorithm discussed in this paper. Under af = 
. . . = the procedure is equivalent to the least squares algorithm of Bock 
(1969). 
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There is some literature that compares mixture and fixed partition ap- 
proaches applied to location-scale and especially Gaussian distributed clus- 
ters (e.g. Bryant and Williamson (1986)). Analogously to the location-scale 
case it can be shown that FRFP-ML leads to inconsistent parameter estima- 
tors. This does not matter in practice if the clusters are well separated, but 
causes serious problems otherwise. Like FRM-ML, FRFP-ML needs some 
lower bound on the error variance parameters since otherwise InL^^ would 
be unbounded. 

The approaches for the estimation of s discussed in section 2 are not reason- 
able here because the number of parameters j{i) increases with n and their 
value range increases with s. The following modification of the SC worked 
very well in simulations: 



s := argmaxlnZ/^(s) — — 0.7sn, k{s) ~ {p + 2)5, (4) 

s 2 

k{s) denoting the number of regression and scale parameters. 



4 Random Regressors, Mixture Model 

Random regressors have the advantage that the observations can be treated 
as i.i.d. The random regressors mixture model (RRM) has the following 
form: {xi,yi) G {1} x x i E I, are distributed i.i.d. according to 

5 

yZ = I 5 G'l, . . . ,Gs ^ Q, 

that is, C{x) — Gj within cluster j. Suitable choices for Gj, j = 1, . . . , 5 , 
enable us to model every kind of assignment (in-)dependence. Usually the Gj 
are not of interest, but unknown. For performing ML-estimation, there needs 
to be a parametric specification of Q. This will not be discussed here. A 
more general approach is presented instead. The RRM is a special case of the 
contamination model (CM) (choose F* = Ylj =2 (G,/?, cr^) = 

{Gi,/3i,aD, e = €i below): 

y) — ~ c)-^(G,| 8 ,fr 2 ) + eF*, 0 < e < 1, G € (5) 

There is some basic difference between the CM and the former models. The 
parameters (G, /?, a^) are clearly not unique in (5) since they can correspond 
to (Gj, /?j, cTj) of the RRM for each j. Further, if F* is not assumed to be of 
a mixture type, the CM allows for outliers, i.e. points in the data, which do 
not belong to any regression population. In robust statistics, the CM with 
e < I is a standard tool to describe the occurence of outliers. 




184 



A method to analyze the CM should find possible choices for (/?, {G is 
treated as nuisance) and therefore needs no specification of some number of 
clusters. 

This goal can be achieved by means of Fixed Point Clustering. The idea 
of this approach is that a data subset, which contains no outliers, can be 
viewed as homogeneous. If at the same time all other points of the dataset 
are outliers with respect to the subset, then the subset is separated from the 
rest and can be considered as a cluster. 

For an indicator vector g G {0, 1}^ define Z{g) {x[,yi)gi=i- 

Definition: Z{g) is called Fixed Point Cluster (FPC) w.r.t. Z, iff g is 
a fixed point of 



/: {0,1}"H^{0,1}", 

fi{g) = 1 [(y« - a;'/3(Z(g)))2 < ca‘^{Z{g))] 

with some prechosen constant c (e.g. c = 10). p{Z{g)) and a‘^{Z{g)) are 
regression parameter and error variance estimators based only on the data 
subset Z{g), e.g. the ML-estimators from (3). 

The function / is an inversed outlier identifier (0 for outliers) based on the 
random regressor linear regression model. That is, a point is considered as 
an outlier w.r.t. F(^Q,i 3 ,a'^) if if f^Hs into the outlier region {{y — x'/?)^ > ca^} 
(see Davies and Gather (1993) for the concept of model based outlier re- 
gions). Therefore an FPC Z{g) is exactly the set of non-outliers in Z w.r.t. 
Z{g) and can be interpreted as the set of “ordinary observations” generated 
by some member of the c.r.d.-family. 

The method is similar to some procedures for robust regression where the 

goal is to find a solution of X] = min^. The function p also pro- 

vides some kind of outlier identification. Local minima could be interpreted 
as parameters for clusters (Morgenthaler (1990)), but the choice of is not 
clear and a robust estimator would depend on at least half of the data. This 
is not adequate for cluster analysis. 

FPCs can be computed with the usual fixed point algorithm — f{g^)) 
which converges in finitely many steps (proven in Hennig (1997b)). In order 
to find all relevant clusters, the algorithm must be started many times with 
various starting vectors g. A complete search is numerically impossible. 
However, this also holds for the other two methods unless one is satisfied 
with a local maximum of unknown quality of the log-likelihood function. 
The FPC methodology does not force a partition of the dataset. Non-disjoint 
FPCs and points are possible, which do not belong to any FPC. According 
to that, FPCs are rather an exploratory tool than a parameter estimation 
procedure in the case of a valid partition or mixture model. 

The application of FPC analysis to more general situations is discussed in 
Hennig (1997a), Hennig (1998). 
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5 Iris Data Example And Comparison 

Fisher’s Iris data (Fisher (1936)) consists of four measurements of three 
species of Iris plants. The measurements are sepal width (SW), sepal length 
(SL), petal width (PW) and petal length (PL). The species are Iris setosa 
(empty circles in figure 2a), Iris virginica (filled circles) and Iris versicolor 
(empty squares). Each species is represented by 50 points. Originally, the 
classification of the plants was no regression problem. The dataset is used 
for illustratory purposes here. Find a more “real world” but less illustratory 
example in Hennig (1998). Only the variables SW and PW are considered. 
PW is modeled as dependent of SW. The distinction in “regressor” and 
“dependent variable” is artificial. The methods use no information about 
the real partition. By eye, the setosa plants are clearly seperated from the 




Figure 2: Iris data: a) original species - b) FRM-ML clusters with SC 




Figure 3: a) FRFP-ML clusters - b) Fixed Point Clusters 

other two species, while virginica and versicolor overlap. A linear regression 
relation between SW and PW seems to be appropriate within each of the 
species. 

Using the SC for estimating the number of clusters, FRM-ML-estimation 
finds the four clusters shown in figure 2b. Three clusters correspond to 
the three species. FRM-ML is the only method which provides a rough 
distinction between the virginica and versicolor plants. The fourth cluster 
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(crosses in figure 2b) is some kind of “garbage cluster”. It contains some 
points which are not fitted good enough by one of the other three regression 
equations. Note that the deviation from assignment independence of the 
four cluster solution seems to be lower than that of the original partition 
of the species. The AIC for estimating the number of cluster leads to five 
clusters by removing further points from the three large clusters and building 
a second garbage cluster. 

By application of (4), the number of clusters is estimated as 2. Figure 3a 
shows the ML-classification using the FRFP. It corresponds to the most 
natural^ eye-fit. The well separated setosa plants form a cluster, the other 
two species are put together. 

With 150 randomly chosen starting vectors, four FPCs are found. The first 
contains the whole dataset. This happens usually and is an artifact of the 
method. One has to know that to interpret the results adequately. The 
second and third cluster correspond to the setosa plants and the rest of the 
data, respectively. The point labelled by a cross falls in the intersection 
of both clusters and is therefore indicated as special. The fourth cluster is 
labelled by empty squares and consists of 29 points from the setosa cluster, 
which lie exactly on a line because of the rounding of the data. The other 
methods are not able to find this constellation because of the lower bounds 
on the error variances. 

After having noticed this result, one realizes that there are other groups of 
points, which lie exactly on a line, and which are not found by the random 
search of Fixed Point Clustering since they are too small. The fourth FPC 
contains more than half of the setosa species^ and is therefore a remarkable 
feature of the Iris data. 

The results from the Iris data highlight the special characteristics of the 
three methods. The simulation study of Hennig (1997b) leads to similar 
conclusions. 

FRM-ML-estimation is the best procedure if assignment independence 
holds and if the clusters are not well separated. At the Iris data, it 
can discriminate between virginica and versicolor. The stress is on 
regression and error variance parameter estimation. 

FRFP-ML-estimation is the best procedure under most kinds of assign- 
ment dependence to find well separated clusters if there is a clear par- 
tition of the dataset. At the Iris data, the procedure finds the visually 
clearest constellation. The stress is on point classification. 

Fixed Point Clustering is the best procedure to find well separated clus- 
ters if outliers or identifiability problems (Hennig (1996)) exist. Its 
stress is on exploratory purposes. By means of Fixed Point Clustering, 
the discovery that a large part of the setosa cluster lies exactly on a 
line was made. 



^It is not clear, what “most natural” means, but this is the impression of the author. 
^One cannot see 29 squares because some of the points are identical. 
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Abstract: Statistical clustering of multivariate observations is considered under 
three types of distortions: small-sample effects, presence of outliers, and Markov 
dependence of class indices. Asymptotic expansions of risk are constructed and 
used for analysis of robustness characteristics and also for synthesis of new clus- 
tering algorithms under distortions. 



1 Introduction 

The main problem of cluster analysis is in construction of a decision rule or 
clustering algorithm by an unclassified training sample X = {xi, . . . , C 
which splitts the sample X into L clusters and can classify some “new” 
observations: — Each of known clustering algorithms uses not 

only the experimental data X but also some prior model assumptions, e.g.: 
Gaussian probability distribution and independence of sample elements, ho- 
mogeneity of clusters, absence of outliers and missings. A system of these 
prior assumptions is called a hypothetical model of clustering. 

In applications, the traditional hypothetical models of clustering are usu- 
ally not absolutely adequate to real phenomena and are subjected to some 
distortions (see Huber (1981), Hampel (1986), Rieder (1994), Bock (1974), 
Bock (1989)). Because of these distortions the traditional clustering algo- 
rithms (optimal w.r.t. fixed hypothetical models) often lose their optimality 
and become unstable (see McLachlan et.al.(1988), Kharin (1997a, 1997b), 
Kharin et.al. (1993)). That is why the research area “Optimality and Ro- 
bustness in Cluster Analysis” devoted to analysis of optimality and stability 
of traditional clustering algorithms and to synthesis of new robust decision 
rules is very topical. This paper is devoted to optimality and robustness 
problems of clustering for three types of distortions: A) small-sample ef- 
fects; B) presence of outliers in sample; C) Markov type dependence of class 
indices. 



2 Hypothetical Model and Its Distortions 

Let in a feature space a regular m-parametric family of A/^-dimensional 
probability density functions (p.d.f.s) Q = {q{x-,9'), x € : 6' £ Q C 

R™} be defined, the parameter 9' = (0j^) G 0 be identifiable, and 0 be a 
compact set. Let the random observations from L >2 classes . . . , fix, be 
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registered in with probabilities ttJ, . . . , tt^ (tt^^ >0, ttJ + . . . + tt^ == 1). 
An observation from the class Qi is a random vector Xi G with a 

p.d.f. q(x;0f), i e S = {1, 2, . . . , L}; the aggregate Lm-vector of true 

parameters 0^ = : . . . is unknown. The registered sample X = 

{xi, . . . ,Xn} C R^ consists of n independent observations; the t-th observa- 
tion Xt G R^ belongs to the class ^d^{t = 1, n). Destination of any clustering 
procedure is to estimate the true classification vector = (dj, . . . , d^) G 
up to renaming of the classes: D = (di , . . . ,dn) = D{X) G S'^. We shall 
analyze here this hypothetical model of clustering under distortions of three 
types. At first, we will evaluate infiuence of small-sample effects on opti- 
mality of the clustering procedure. At second, we will consider the situation 
where the sample X contains random outliers. It means, that in this sample 
an observation from the class is really described by the Tukey-Huber 
mixture of p.d.f.s (see Kharin, Zhuk (1993)): 

pI{x\9°) = (1 - ei)q{x-,e^i) +Sihi{x),x G R^, (1) 

where hi[x) is any p.d.f. of outliers distorting the class 

K{x) > 0, hi{x)dx = 1, (2) 

6i is a distortion level — unknown probability of outlier presence (0 < < 

6i^ < !)• At third, we will consider the situation, where the class indices 
{d^} are stochastically dependent: {d^} is a homogeneous Markov chain 
with an initial probability distribution 

7r«^(7T?), = (3) 

and with a matrix of transition probabilities 

p° = (4)- 4 = Pidli = M = ih ij e 5. (4) 

3 Optimality of Clustering and 
Small- Sample Effects 

Let US consider the so-called (see McLachlan et al. (1988)) plug-in clustering 
decision rule (PDR): 

d = di{x\ 6) = argmax i'K%{x] 0^)) , x G R^, d e S, (5) 

where x is any observation (from X or a “new observation”) to be classified, 
» = (ep ■ ■ ■ :»l) is the ML-estimator of 0° by the unclassified sample X: 

0 = argmax/(6>), /(6») = 

” t=l i^S 



( 6 ) 
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The PDR (5) is derived from the Bayesian decision rule d = do{x) (min- 
imizing the risk of classification) by substitution the estimator (6) for the 
unknown vector 6^. 

Let us characterize performance of the PDR (5), (6) by the functional of risk 
(probability of error at classifying of a random observation from 

r{di) = 1 - E ( / q{x-, 0- , (7) 

where E{-} means expectation w.r.t. the p.d.f. of the random sample X. 
For simplicity of notation in this section we will present the results for the 
case of L = 2 classes. Introduce the notations: l{z) — unit step func- 
tion; G{x\ 9^) = 7T2g(x; ^2) — ^1) is the Bayesian discriminant function 

defining the Bayesian decision rule (BDR): d = do{x) = l{G{x]9^)) + 1; 
po - {x : G{x] 9^) = 0} C is the Bayesian discriminant surface; Hi = 
E^o V^o lng(X^; is the Fisher information matrix for the class Qi; 

J® = E510 {— V^o lnp^(X; 0^)} is the Fisher information matrix for the mix- 
ture of p.d.f.s /(x; 9^) = Zies 

Q{x) = (VeoG{x; 0°))^ {J°)-^VeoG{x; 6°) >0, xE R^. (8) 

Theorem 1 If the parametric family of p.d.f s Q satisfies the standard reg- 
ularity conditions (see, for example, Nguyen (1989)) and p.d.f. q{x;9) is 
differentiable w.r.t. x G for any 9 E Q, then the risk functional (7) for 
the PDR (5), (6) admitts the asymptotic expansion: 

r{di) = ro + a/n + 0(n“^/^), (9) 

where ro = r{do) is the risk value for the BDR when the true parameters 
{0?} are a priori known, and a is the coefficient of asymptotic expansion: 

« = J / g(x)|V,G(x;0°)|-'ds;v-i > 0. (10) 

Proof is conducted by the method of asymptotic expansion of risk (Kharin 
(1996)) and consists of 3 steps: 1) construction of stochastic expansion of 
the estimator (6); 2) construction of asymptotic expansion for the moments 
of 9] 3) construction of expansion for the risk functional. ■ 

Corollary 1 Under the conditions of Theorem 1 the PDR (5), (6) is asym- 
ptotically optimal: r ^ ro at n oo. 

For comparison, let us present the asymptotic expansion of risk r*_ for the 
situation where the sample X is classified (the case of discriminant analysis) 
and ni = n^i (see Kharin (1996)): 



r* = -f p/n + 0{n ^^^) 



( 11 ) 
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p = p\lA + Pi = \ j^^Qiix)\'^xG{x-,e°)\ ^dsN-i>o, 

Qi{x) = 7r“f {Veoq{x-,e^)Y H-^Veoq{x-,e^) > 0. 

As it is seen from comparison of (9) and (11), the convergence orders of the 
risk to ro in the cases both of classified and unclassified sample X are the 
same: 0{n~^). But the convergence rates are different and determined by 
the coefficients p and a respectively. Let us find a relation between p and a. 
Introduce the notations: 7 == maxjt > 0, 



Eki 



L 



l^H^\0>lq{x;0°2) ^ ,n(r-0°W^n(r-P^)dr 



Theorem 2 If the conditions of Theorem 1 are satisfied and the asymptotics 
of ‘‘small overlapping of classes” takes place: 7 — > 0, then 

Of = p + A + 0(7^), 



1 






2 




EkiHpV,oq{x-0'^)\VMx-,e°)\-^dsM-i 



> 0 . 



Proof is conducted by asymptotic analysis of the matrix in (8). ■ 

For any > 0 let us define the 5-admissible sample size ns as the minimal size 
of the sample X for which the relative risk increment satifies the inequality: 
(r(di) — ro)/ro < S. Analogously, for the case of classified training sample 
we define the 5-admissible sample size nj by the inequality (r* — ro)/ro < 6. 



Corollary 2 If Q is the family of N- dimensional normal densities: q{x]6) 
= n^ix I 9, E), then 






A2 



rix 



1 N 

6 \T T 



+ 1, ns ^ Tig + 



where A = ^(02 — — ^i)) ^ is the Mahalanobis distance. 



/![g-AV8 



165 V 2 



+ 1) 



( 12 ) 



4 Robustness of Clustering Under Outliers 

Let US analyze now influence of outliers (1), (2) on the PDR (5), (6). In this 
case A is a random sample from the distorted mixture of L p.d.f.s (1): 

p^{x\ e^) = ^ TT^ptix; 6>‘’) /(x; 6»°) + n°Si [hi{x) - q{x\ 6>°)) . 

ies ies 
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Introduce the notations: = maxi^s is the maximal distortion level; Ip 

is the p- vector all elements of which are equal to 1; 

Hq{ 6^] 0) = — 9^) \np^{x; 9)dx — Shannon functional; 

M^) = j^N - Q{x;0i)) lnp°{x;9)dx, A\^\e) ^ VeAi{6), i G S. 

Theorem 3 If Q satisfies the regularity conditions, the functions Hq( 6^\9), 
{Ai{9)} are thrice differentable w.r.t. 9 e Q and the extreme point 9^ = 
argmin^^0L 9) — Y^ies ^i'T^^^i{9)) is unique, then the almost sure con- 

vergence of the MLE (6) under outliers (1), (2) takes place at n oo; 

§ ^ r + Y,e,n^{jY^Af\e^) + o(4)u„^. 

ies 

Proof is made in the same way as the proof of strong consistency of ML- 
estimators. ■ 

It is seen from Theorem 3 that under outliers {e^ > 0) the ML-estimator 
(6) loses its consistency, and, as a result, the PDR (5) converges at n oo 
to the decision rule different from BDR: 

d — d{x; 9^) = argmax (n^qix] 9^)^ , x G R^. (13) 



That is why we will characterize asymptotic (at n oo) robustness of the 
PDR (5) under outliers by the risk functional (probability of classification 
error) for the limit decision rule (13): 

r" = 1 - ^ 7r° / g(x; e°)dx, (14) 

i^S Jdi{x\6^)=t 

by supremum: = sup^^j r^, and by robustness factor n = (r+— ro)/ro > 0. 

Let us present here our results for the case of L = 2 classes. Denote: («/^)(i| 
is (Lm X m)-matrix — the z-th block-column of the matrix ( J°)~^ ^zj(^) = 
7T°(J°)(~|V^og(x;0?) - 7r°(J^)yjV|9og(a:; 0°) is Lm- vector-column; K^{x) = 
Yi^s'^^hfix) is the mixture of the contaminating p.d.f.s. 

Theorem 4 If the conditions of Theorem 3 hold, then the risk functional 
defined by (14), (13) under outliers (1),(2) satisfies the asymptotic expansion: 

= ro + ejpn + e\p22 + 2ei£2Pi2 + 0(£+), 

Pij = ^^^A^i^'^{e°)G2A^p{e^),G2 = j ui2{x)u[2{x)\V xG(x\9^)\~'^dSN-l- 

ro 
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Proof is conducted by the general method of asymptotic expansion of risk 
presented in Kharin (1996). ■ 



Corollary 3 In the case of two equidistorted classes {£{ = €< 



K, 



r+ = ro + pe\ + 0(£+), p = {x)GI ^\ 00 lnp“(x; 6^)dx , 

= K+ 0 (e+), K < £+ f N h'^{x)V0o lnp°(x; 0^)02^00 lnp(x; 9 ^)d 

Jrv. 



X. 



Corollary 4 If Q is the family of N -dimensional normal densities: q{x; 9) 
= nff{x I 9,11), the classes are equiprobable (Tr® = tt®), equidistorted (ei+ = 
£ 2 + = £^+), and hi{x) = h 2 {x) = K^{x) = q{x\9'^), where 9~^ G is 
mathematical mean of outliers, then 



K < 



4<^(A/2) 

A#(-A/2) 



A^ 1 2 

^ ^ i=l 



9+ - S-1(0+ - 9°)+ 



where 0 ( 2 :), $( 2 :) are the p.d.f. and the distribution function for the standard 
Gaussian distribution A^i(0, 1). 



As it is seen from Theorems 3, 4 and Corollaries 3, 4, presence of outliers can 
influence on the deviation 6^ — 6^ and on robustness factor k, very strongly. 
To reduce this influence we propose (Kharin (1997b)) a new robust clustering 
algorithm with smoothing which consists of the following steps. 

1. Construct any nonsingular robust estimate S = {ski){kj = 1,-/V) of 
common covariance matrix S by the sample X (e.g. Huber (1981) estimate). 

2. Construct the matrix B = (bij) of Mahalanobis distances between all 
elements of the sample X: bij = y {xi — Xj)'^S~^{xi — Xj) > 0 {i,j = l,n). 

3. For flxed integer M (parameter of the algorithm) and i = l,n: And M 

nearest neighbours of the point xf. Xjj(j), . . . G X] make smoothing 

of the observation Xj, that is replace it by the local mean point: 

Xi = (xi + Xj,(i)) /(M + 1). 

This step results in the “smoothed sample” X — {xi, . . . , 

4. If it is necessary repeat steps 2, 3 for reiterated smoothing of X. 

5. Apply the well known clustering algorithm “L-means” to the “smoothed 
sample” X and get the estimate of classiflcation vector — (dj, . . . , d°). 

Some computer experiments with this robust clustering procedure are pre- 
sented in Kharin (1997b). 
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5 Optimal Clustering Under Markov Depen- 
dence of Class Indices 

Let US consider the third type of distortions described in Section 2 when the 
class indices {d^} are stochastically dependent by a Markov chain with any 
fixed probability characteristics (3), (4). 

At first, let the probability characteristics of the model tt^, 6^ are a priori 
known. For a clustering decision rule D = D{X) = (di, . . . , dn) G 5^ define 
a bounded loss function w = w{D',D”), P', D” G where u; > 0 is the 
loss value for the case: = D\ D — D” . We will characterize performance 

of a clustering algorithm D = D{X)hy the risk functional (expected losses): 

r = r(P(-)) = E{w{D\ D[X))] > 0. (15) 

Theorem 5 If the true values are a priori known, then the opti- 

mal decision rule minimizing the risk functional (15) is 

b = Do(X) = arg min ^ w(J, D) (f[ q{xt, 0° )) < H 

Jes^ \t=i J t=i 

Proof is conducted by using of the expressions (3), (4), (15). ■ 

Corollary 5 Optimal decision rule minimizing the special case of risk — 
the probability of classification error r = P{D ^ D^} — has the form: 

D = Dq{X) = arg niax A(A; D, 9^, tt^, P^), (16) 

A{X; D, e\ 7t“, P«) = Ai(X; D, 0°) + A2(Z), tt®, P°), 

Ai(X; AO = A 2 (A P°) = ln< + g 

t=i t=i 

The objective function in (16) admitts the equivalent representation: 
A{X;D,9\it\P^) = ftidt,dt+i) , 

t=l 

ft (dt, dt+i) = 5ti (\mtl + In^ (xf, OX)) + lnp2e,d,+i + In? (a^t+ii • 

This representation allows to use the efficient method of dynamic program- 
ming to solve the maximization problem (16). 

Consider shortly the case when the vector of parameters 6^ is unknown. 
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Theorem 6 If the parameter value 6° is unknown, then the decision rule 
that maximizes the upper bound of probability of error-free classification is 

D = D,{X) = arg max A(X; D, 7t°, P°), 

A(X; D, 7t“, P“) = A 2 {D, 7t°, P°) + max Ai(X; D, 6). 

Proof consists in maximizing of the indicated upper bound. ■ 

Some computer algorithms realizing the decision rule D = D^{X) are pro- 
posed and evaluated in Kharin (1996). 
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Abstract: A finite sequence of independent binomial variables is considered for 
which the null hypothesis of homogeneity is to be tested against the alternative 
hypotheses of one or two change-points, respectively. The tests are based on the 
likelihood ratio statistic or on an approximate linearization of the log likelihood 
ratio statistic. This yields scan type or cusum statistics in a discrete situation. At 
the same time maximum likelihood or approximate maximum likelihood estimates 
of the locations of the change-points are derived. For the problem with two 
change-points - corresponding to the detection of a cluster in time - we distinguish 
between the cases with a fixed and a variable distance of the two points. Under 
the null hypothesis exact upper bounds for the upper tails are derived. In the 
special case of Bernoulli variables these bounds are considerably simplified. 



1 Introduction 

Hinkley and Hinkley (1970) considered c independent Bernoulli variables 
Ai, . . . , Ac, where 



P {Xi = 1) = 7To, P (Ai = 0) = 1 - 7To for i = 1, . . . , r, 

P {Xi == 1) — TTi, P {Xi 0) = 1 - 7Ti for z = r + 1, . . . , c. (1) 

They assumed that ttq and tti are known but that the change-point r is 
unknown. The authors derived the maximum likelihood estimate of r and 
likelihood ratio statistics for testing hypotheses about r. Furthermore, the 
asymptotic distributions of f and the likelihood ratio statistics were dis- 
cussed. These asymptotic distributions are unaffected if ttq and tti are also 
unknown. 

Pettitt (1979) generalized a distribution-free cusum technique proposed by 
McGilchrist and Woodyer (1975) by studying a statistic which is equivalent 
to the Kolmogorov-Smirnov two-sample statistic. This approach was ex- 
tended in Pettitt (1980) where a conditional exact test based on a cusum 
type statistic was compared to a likelihood ratio statistic and where differ- 
ent estimates of r were discussed. In Pettitt (1979) the extension of the 
Bernoulli to the binomial case was also considered. 

A likelihood ratio and a cusum statistic for binomial variables were compared 
in Worsley (1983). Here, iterative procedures for calculating the exact null 
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and alternative distributions for both test statistics were derived as well as 
approximations for these distributions. Corresponding limit distributions 
were given in Horvth (1989). 

Fu and Curnow (1990) proved that a discrete time version of the scan statis- 
tic is equivalent to the likelihood ratio, if Bernoulli variables and a known 
distance (5) between two change-points are assumed. Recurrence equations 
were derived both for the exact null and alternative distributions under the 
assumption that 7ro,7Ti, and 5 are known. Further, the distribution of the 
maximum likelihood estimate of the location of the changed segment was 
given. The changed segment corresponds to a cluster in time. Bounds and 
approximations for the null distribution of the discrete scan statistic were 
derived by Glaz and Naus (1991) who considered not only Bernoulli distribu- 
tions but also binomial as well as general discrete distributions. Wallenstein 
et al. (1994) derived approximations for the null and alternative distribu- 
tions in the case of Bernoulli variables. In particular, they studied the power 
of the discrete scan test, for the case that the distance (5) of the two change- 
points does not correspond to the window width used in the discrete scan 
statistic. 

Levin and Kline (1985) derived an exact conditional cusum test for the case, 
where binomial variables are assumed with two change-points of which the 
distance is unknown and with an unknown baseline parameter ttq. Alter- 
native asymptotic tests for the problems considered in Worsley (1983) and 
Levin and Kline (1985) were proposed in Lombard (1987). 

Here, we derive likelihood ratio statistics and approximate log likelihood 
ratio statistics for the detection of one or two change-points in binomial 
variables. The approximations turn out to be cusum type statistics. Under 
the null hypothesis of no change exact upper bounds for the upper tails 
are derived. However, these bounds seem to be of practical use only in the 
special case of Bernoulli variables. 



2 Problems with One Change-point 

Let Xi, . . . , Xc be independent binomial variables with 

P {Xi = j) = (^■^'^0 (1 - for « = 1, . . . , r, j = 0,1, ... ,rii, 

P {Xi = j) = (1 - 7Ti)"‘“^ for i = r+1, . . . , c, j - 0 , 1 ,..., Ui, (2) 

where r G {1, . . . , c — 1}, 0 < ttq < tti < 1, € {1, 2, . . .} for i = 1, . . . , c. 

We consider the test problem Hq : tti = ttq vs. Hi : tti > ttq. With the 
notation 

j j 

Mj = iVj = y] rij for i = 1, . . . , c 



( 3 ) 
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the likelihood is given by 

L (<r„.,„r) = (n ( 2 )) (1 - ’'o)"'-"' 

X (1 - 

Under Hq we have the maximum likelihood estimate 7fo = ^ of tto, while 
under H\ the maximum likelihood estimates 

result. The maximum likelihood estimate of r is that value of r G {1, . . . , 
c — 1} which maximizes L ^tto, tti, r^. 



Without doubt, the likelihood ratio 



f-Mr / - \Nr-Mr ^Mc~Mt (. \N c~N t ~Mc^-Mr 

^ (1 - ^0 j 7 Ti [1 - 7Ti j 

X7ro-^‘^(l-7ro)"^-"^=} (6) 

is a complicated function of the statistics and Me. Therefore, Worsley 
(1983) and others consider the distribution of LRl under Hq conditional on a 
fixed value of Me. If Hq is true, then Me is sufficient for the parameter ttq and 
the conditional distribution of LRl does not depend on tto- This is the reason 
why we adopt this approach and consider the conditional distribution of 
LRl. However, as the results of Worsley (1983) show, the exact conditional 
distribution of LRl is still difficult to compute. Therefore, we propose a 
linearization of the log likelihood ratio statistic. To do this we consider the 
reparametrization tti == ttq + e with 0 < e < 1 — tto. 

The null hypothesis corresponds to e = 0. We use a Taylor expansion at 
this point, i. e. we replace L(7 To, tTq + c, r) / L(7 To, tTo, •) by the first derivative 
with respect to e at the point e = 0: 



In 



L + e,r) _ 1 

L{tTo,T^Oj') ^0 (1 ~ TTo) 



(Me -Mr -no {Nc - Nr)) + 0 (e) . (7) 



Close to Ho we can, thus, approximate the log likelihood ratio by the statistic 



max {Me — Mr — no {Nc — Nr)} 

Kr<c-1 



( 8 ) 



and for a fixed value Me = rrie by 

Ti = ^ max ^ {me - M, - ^ (iVe - Nr ) } . (9) 

Here, the approximate maximum likelihood estimate f of r is given by that 
value of T for which the maximum is achieved. The statistic Ti is equivalent 
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to the cumulative sum statistic as understood by Pettitt (1980), Worsley 
(1983), and Horvth (1989). Levin and Kline (1985) point out that their 
cumulative sum statistic is defined in a different way. 

If we set 

A,u = - Mr-^{Nc- Nr) > m| = 

{me - M, > W + ^ (iVe - TV,)} , (10) 

the first upper Bonferroni bound yields 

Pho (Ti>u \ Me = rric) = ^ U ^r,u | Me = 

C— 1 

^ H Pho (A,u i Me ^ me) . (11) 

T=\ 

With Dr - \u+^{Ne- A^r)l W6 get 

min{mc , A^c — -/Vr } 

Pho (At,u I Me = me) = (^c - M, = T | Me = me) (12) 

i=yr 

with 

Pho [me - Mr = i\Me = me) = (■^") y) , (13) 

V^c/ xi+---+Xr=mc-i j=:l \^jj 
\-Xc=i 

summing over all values Xj e {0, 1, . . . , n^ } for j = 1, . . . , c for which the 
two conditions hold. 

Obviously, these bounds are not easy to compute, if rUc is large. The problem 
is considerably simplified in the Bernoulli situation with m = • • • = rze = 1, 
Xi e {0,1} , Ni = i ioT i = 1, , c, 

^ u + ^(c-r) , (14) 

min{mc,c— r} 

Pho {^t,u I Me = me) = ^ Pho [me ~ Mr = 1 \ Me = me) , (15) 

i=yr 



Pho ("^c - Mr =i \ Me = me) = 



-1 




200 



3 Problems with Two Change— points 

Let Xi, . . . , Xc be independent binomial variables with 

P{Xi=j)= ttI (1 - TTo)”'--^ for i = 1, . . . , T, r + 5 + 1, . . . , c, 

j 0 , 1 , . . . , 71 j , 

P [Xi = j) = (1 - for i = T + 1, . . . , r + ^, j = 0, 1, . . . , ni, 

^ (17) 

where J G {1, . . . , c — 2} , r € {1, . . . , c — (5 — 1} , 0 < ttq < tti < 1, 



rii G {1, 2 , . . .} for i = 1, . . . , c. 

We consider the test problem Hq : tti — tto vs. H\ : tti > tto- With the 
notation 

■V? = Nj = ^rii for j = (18) 

i=l i=l 

the likelihood is given by 
L(7T0,7ri,r,5) = (”')) (1 - 

^^Mr+S + Mr _ ^^-^Nr+i-Nr-Mr+S+Mr _ 

3.1 Situation with ttojTTi, and S Known 

If 7 To, 7 Ti, and 6 are known the likelihood ratio is simplified to 




Under Hi, tti (1 - tto) (1 - 7Ti)“^ > 1. Thus, for fixed r the func- 
tion to be maximized is a monotonic increasing function of the statistic 
{Mt+s — Mt), if ni = • • • = rxc- Therefore, in this case, we can use the 
equivalent statistic 



T 2 = max {Mr+6 - Mr } , (21) 

l<r<c— <5— 1 

instead of LR2. This statistic does not depend on the values of ttq and tti. 
Formally, it is similar to the linear ratchet scan statistic (Krauth, 1992), 
which was studied in another context. However, in T 2 the two extreme 
windows corresponding to r = 0 and t = c — 6 are not considered. The 
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approximate maximum likelihood estimate of f of r is given by that value 
of r for which the maximum is achieved. 

If we set Br^u = {^r +<5 — Mr > u} the first upper Bonferroni bound yields 
in analogy to (10) 



c—S—l 

Pho {T2>u\Mc = me) < {Bt,u I Me = me) . (22) 

T = 1 

With y = \u] we get in analogy to (11) and (12) 



P Ho I Mq — ^c) 



min { me, ».) ^ -1 



E 

t=zy 



c 

rric 



E 



®T+l+-"+®r+<5=* 

H hxr +a!T-|-<5+l hxc=^c 



nC'l 

j=i 



(23) 



For large values of rric these bounds are not easy to compute. However, for 
the Bernoulli situation with ni = •••== rze = 1, G {0, 1} , Ni = i iox 
i = 1, . . . , c, we get in analogy to (11) and (12) 



Pro {^t,u I Mq — TTlc) — 




(24) 



In the Bernoulli situation, T 2 is a discrete time version of the scan statistic 
as was observed by Fu and Curnow (1990) and Wallenstein et al. (1994). 
However, while we defined the statistic T 2 as the maximum over r G {!,..., 
c — — 1} , Fu and Curnow (1990) considered r G {0, 1, . . . , c — 5 — 1} , and 

Glaz and Naus (1991) as well as Wallenstein et al. (1994) r G {0, 1, ... , 
c — 6} . Our definition is consistent with the alternative hypothesis where 
the variables Xi and Xc always correspond to the parameter ttq while Glaz 
and Naus (1991) and Wallenstein et al. (1994) also consider the two extreme 
windows corresponding to r = 0 and r = c — 6 in analogy to the continuous 
time scan statistic. 



3.2 Situation with ttojTTi Known and S Unknown 

As already noted for the continuous time scan statistic (Krauth, 1998) the 
assumption of a fixed window width is very restrictive. If we assume in 
Section 3.1. that S is unknown we get for the likelihood ratio 



LR3 = max 

<5€{l,...,c-2} 



7Tl (l-7To) \^-+^ 

7ro(l-7ri)/ Vl-TTo/ 



(25) 



instead of LR2. Just as in Section 2 we consider the reparametrization 
7Ti = 7To + e with 0 < € < 1 — 7To and linearize the log likelihood ratio by 
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replacing the expression to be maximized by its first derivative with respect 
to e at the point e = 0 : 



In 



L (tto, 7To + €, r, 6) 



{M-r+S + Mj — TTq (^t+S — ^t)) + O (e) . 

(26) 



L (tTo, TTo, •, •) TTo (1 — 7To) 

Thus, we can approximate the log likelihood ratio close to Hq by 

max {Mr+s - Mr - TTo {Nr+S - Nr)} (27) 



!<<5<c-2 

1<T<C—S—1 



and for a fixed value of Me = rUc by 

T 3 = max I Mr4-s — M^ 

l<6<c-2 



nrir 



Kt<c—5 — 1 



r-^{Nr+S-Nr)\. 



(28) 



Like Ti , the statistic T 3 is equivalent to a cusum statistic as understood by 
Pettitt (1980), Worsley (1983), and Horvth (1989). 



If we set 

Cs, T,u = — Mr — (Nr+S ~ Nr) > u| = 

i^Mr+S - Mr >U+"^ [Nr+S ~ iV,)| 
the first upper Bonferroni bound yields 

( C-2C-6-1 

U U 1 Me = 

< 5=1 T=1 

c-2 c-S-1 

<J 2 Y.^Ho {Cs, r,u I Me = rUe ) . 

i=l T=1 

With ys,r = [u + ^ {Nr+S ~ .W-)] we get in analogy to (11) and (12) 



(29) 






(30) 



min{mc,iVr+<S-^r} / 

PHo{Cs,r,u\Me = me)= ^ ^ 

i=y6,r 



rric 



-1 



E 



n 



®r + l+---+®r+(5=^ 

®1 H hxr \-Xc=mc — ' 



j=i \^j 



n. 



(31) 



Again, these bounds are not easy to compute, if rUe is large. However, in 
the Bernoulli situation with ni = • • • = ric = 1, Xj G {0, 1} , Ni — i for 
i = 1 , . . . , c, we get 



T 3 = max \Mr+s - Mr -—5} , ys,r = 
i<s<c -2 y c) 



Kt<c—6—\ 



u H 0 



(32) 
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Prq {Cs^TyU I — ^c) ~ ^ ^ 

i=ys,T 



-1 



c-6 
rric — i 



(33) 



By setting r + 5 — c, i. e. by assuming that the second change-point is 
located at (c+ 1 ) or, in other words, that there exists only one change- 
point for Xi, . . . , Xc and that this is located at (r + 1 ), the formal similarity 
between the statistic Ti in Section 2 and the statistic T 3 in Section 3.2 is 
obvious. 



4 Discussion 



In the literature, likelihood ratio tests and cusum tests are treated as dif- 
ferent approaches, which may be compared with respect to power (Pettitt 
(1980), Worsley (1983)). Our results demonstrate that cusum type statistics 
for testing for change-points in binomial variables may be derived by con- 
sidering an expansion of the log likelihood ratio statistic in a neighborhood 
of a fixed but arbitrary point of the null hypothesis of homogeneity. This 
means that this type of cusum test has approximately the same power as the 
corresponding likelihood ratio test in a sufficiently small neighborhood of the 
null hypothesis of no change-point. However, one should keep in mind that 
Worsley (1983) and Horvth (1989) showed that the likelihood ratio test for 
testing for one change-point in a sequence of independent binomial random 
variables is much more powerful than the cusum test close to the ends of the 
sequence. The tests for one or two change-points in binomial variables may 
be applied to epidemiological data similar to those considered by Worsley 
(1983). There, for each of the c = 17 years from 1960 to 1976 the number 
[xi] of cases of birth deformity talipes in the first month of gestation was 
given together with the total number (n^) of all births, that were in the first 
month of gestation in the same year and region in northern New Zealand. 
Worsley (1983) observed that both the likelihood ratio and the cusum statis- 
tic yielded the estimate f = 7, which means that there was a change of the 
probability of occurrence of club foot from the year 1966 on. This is in- 
teresting, as in this region in 1965 the herbicide 245-T was used for the 
first time. The tests for change-points in Bernoulli variables assume that at 
most one incident may occur within a time unit. This implies that incidents 
are rather rare or that the chosen unit of time is rather small, respectively. 
Two examples for this kind of epidemiological data were given in Table 1 
and Table 2 of Weinstock (1981). In Table 1 the dates of thirty-five cases 
of oseophageal atresia or tracheo-oesophageal fistula in Birmingham within 
c = 2191 days altogether are given, while in Table 2 the dates of sixty-three 
corresponding incidents in the Newcastle region within c = 3287 days al- 
together are reported. By a different approach (Krauth, 1998) we found a 
change-point on day 1233 for the first data set and a change-point on day 
1718 for the second data set. By the present approach we find one change- 
point at r = 1232 with T\ = 12.681 and U — 0.003285 {U denotes the upper 
bound) for the first data set, while r = 1717, Ti = 18.309 and U = 0.000460 
result for the second data set. If we look for two change-points with 6 un- 
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known we get r = 1232,5 = 942, T 3 = 12.952, C/ = 1 for the first data set, 
and r = 1717,5 = 1560, T = 19.100, [/ = 0.692196 for the second data set. 
If we use these values of 5 in the procedure with known 5, though obviously 
this is not allowed, we get r == 1232, T 2 = 28.000, U = 0.009283 for the first 
data set and r = 1717, T = 49.000, IJ = 0.001153 for the second data set. 
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Abstract: The problem of matching two images of the same objects but after 
movements or slight deformations arises in medical imaging, but also in the mi- 
croscopic analysis of physical or biological structures. We present a new matching 
strategy consisting of two steps. We consider the grey level function (modulo a 
normalization) as a probability density function. First, we apply a density based 
clustering method in order to obtain a tree or more generally a hierarchy which 
classifies the points on which the grey level function is defined. Secondly, we use 
the identification of the hierarchical representation of the two images to guide 
the image matching or to define a distance between the images for object recog- 
nition. The transformation invariance properties of the representations, that we 
will demonstrate, permit to extract invariant image points. But in addition, using 
the identification of the hierarchical structures, they permit also to find the cor- 
respondence between invariant points even if these have moved locally. Finally, 
we mention possibilities to construct hierarchies which integrate more geometrical 
information. The method’s results on real images will be discussed. 



1 Introduction 

Tasks like object and speech recognition, but also the integration of informa- 
tion from multiple imaging sources during a medical intervention are char- 
acteristic of the information age: information is treated, interpreted, and 
integrated. We will present a new definition for a distance between images 
for object recognition and a new strategy for image matching. Hereby, we 
will use a structural abstraction of images in terms of trees. A graph struc- 
ture is useful to represent information about topology and shape of and 
between extracted features in images and therefore to compare two images 
(Pavlidis (1968), Rastall (1969), Barrow and Poppelestone (1971)), but also, 
as we will show, to match images in which local movements appear (e.g., 
images of deformable objects). For this, we have to construct the graph in 
such a way that its structure is kept even if deformations appear or features 
have moved and to use tree identification to find the correspondence between 
these features. 

A grey level function g : IR^ -> IR>o,n = 2 or n = 3, associates to each 
physical point in the image scene (represented by IR^) a value correspond- 
ing to its physical properties (g IN in the technical realization) and will be 
considered as given by the acquisition device. It is eventually smoothed and 
normalized, in order to have a continuous density function. The presented 




206 



approaches apply density based classification methods (Wishart (1969), Har- 
tigan (1975 and 1985), Silverman (1986)) to the grey level function (consid- 
ered as density function) in order to construct trees. Such a density based 
tree representation will adapt itself to global and local movements in the 
image, as detailed in section 2 (Figure 1). This permits us to find corre- 
sponding points (associated to nodes or leaves in the tree), even if local 
movements appear. 

The method given for identification of the two trees corresponding to the 
two images tries to associate the leaves of the trees. It is based on the 
(intra-) tree distance (defined in section 4) which is robust with respect to 
instabilities in the trees. Finally, we get corresponding points (or subsets) 
having a series of applications as detailed in section 5. 

The main originality of the present paper lies in three points: 

• To apply hierarchical classification of the image points (the points on 
which the grey level function is defined) for image matching and to de- 
fine a distance between two images based on their tree representation 
for object recognition. Hereby, we can formulate the precise require- 
ments of the classification task and meet with it Cormak’s claim ( Vhy 
and how classify?’) on classification methods (Silverman (1986), p. 
130). 

• To found our matching strategy on a topological basis, expressed in 
the invariance theorem section 2. 

• To propose a simple tree identification algorithm which respects local 
instabilities of the tree at all levels. The tree identification problem 
arises also in other domains (Shasha et al. (1994)). 




Figure 1: Illustration of the density based tree representation. 

We will remark, that Leu and Huang (1988) used a comparison of ordered 
trees to define a distance between objects for object recognition. However, 
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a tree identification method, able to handle instabilities in the tree, is not 
proposed and the algorithm’s suitability to match images where non-afhne 
deformations appear is not investigated. Moreover, the tree construction 
approach does not use statistical methods, but is of geometrical nature based 
on the closed^ contour in a segmented image. 

A recent general overview of matching algorithms can be found in Lavallee 
et al. (1997). 

The remainder of the paper is organized as follows. In section 2, we de- 
tail the construction of the confinement tree and its invariance property. 
The problems associated to this approach are formulated in section 3. In 
section 4, we show how to identify the hierarchical representations and we 
present some results on real images in section 5, before we suggest a more 
sophisticated classification approach and conclude in section 6. 



2 Construction and Invariance Properties of 
the Confinement Tree 

Confronted with the classification task based on a density function we were 
led to a mathematical entity denoted confiner, appearing in a natural way 
when looking for stochastic equivalents of attractors (Demongeot and Jacob 
(1990)). Given a density function g : IR^ -> IR>o,n G IN, the confiners 
are defined as the maximal connected subsets (or components, Gaal (1964), 
p. 105) Ck of the level sets {x e lR^\g{x) > k},k e lR>o. In 

the classification domain we found these components first in a contribution 
by Wishart (1969) and further investigations by Hartigan (1975 and 1985), 
where they are called high density dusters. Considering them taken on 
several levels including the 0 level, they define obviously a tree as illustrated 
in Figure 1. We call this tree confinement (or density) tree and if ^ is a grey 
level function {n — 2,3) of an Image Ig we use the term confinement tree 
representation of Ig. The intermediary nodes and the leaves of the tree are 
the confiners taken at the respective levels. We have chosen this classification 
method because of the invariance theorem and its consequences shown below. 
Curiously, this property has already been mentioned in the contributions 
of Wishart (1969) and Hartigan (1975) and (1985) (without assuming the 
paradigm to hold and only in the case of an affine transformation) even if 
they did not address a matching problem. 

In practice, we take all grey levels of the image into account and we are 
deleting confiners with a mass (which we calculate as the sum of all normal- 
ized grey values associated to the pixels in the confiner) less than 1% of the 
mass of all confiners at this level. We choose the discrete dA — distance for 
defining the connectivity between pixels. The time to calculate the tree is 
0(/n), where n is the number of pixels and I is the number of levels. 



^Simply edge detection (Canny 1986) is not sufficient. 
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The following strong but realistic hypothesis is the foundation of the trans- 
formation invariance property of our method. 

Medical Imaging Paradigm Let (f) : JR^ -> IR^, n = 2 or 3 be the trans- 
formation expressing movements of physical points in an image scene (ren- 
dered in the images A and B ) which appear when passing from A to B, 
gA the density associated to image A, gs that associated to image B, and 
X G IR^. Then we can assume gA{X) — (apart from local noise 

effects). 

Let us remark, that: (1) for the paradigm to hold, we should consider images 
covering the whole space in which movements appear (thus, 3D images in 
general) and that the paradigma is a realistic assumption in a larger con- 
text than in medical imaging but not in general; (2) if for two images A 
and B the detector (or detector location) has changed we have to assume 
^a(X) = agB{(/){X)) -f /?, where a and (3 can be calculated a priori. 

Invariance Theorem Let A, B, gs, and (j) be defined as in the paradigm. 
Then: if (j) is a topological transformation, then, apart from noise effects, the 
confinement trees of images A and B are identical, moreover, the confiners 
of A are transformed into the confiners of B and reciprocally. 

Proof According to the definition (Gaal (1964), pp. 186, 187) a topological 
transformation is bijective and bicontinuous (i.e., (j) and (j)~^ are both con- 
tinuous). Let us denote L^ = {X\gA{X) > k} and Lf = {Y\gB{Y) > k}. 
Using the paradigm we have Lu = {Y\gB(Y) > k} = {Y\gB((t){(l>~^(Y))) > 
k} - {Y\gAr\Y)) >k}^ {4>{X)\gA{X) >k}^ {cl>{X)\X € A} = 
(f){L^). As (f) is bijective we have reciprocally L^ — (j) ^(Lf). If is a 
component of (i.e., a confiner) then, as (j) is topological (Gaal (1964L 
Lemma IV. 8.1), 0(C^) is a component of (f){L^) and, as 0(L^) = Lf , 0(C^) 
is a confiner of B. As the reverse is true too, 0 induces a bijection between 
the nodes of the two trees. Finally, if node is father of node we have 

c ^ C node is father of node 

The importance of the property lies on the fact that the tree is invariant 
referred to local movements. The effect of noise and other problems will be 
detailed in the next section. 



3 Three Problems 

The first problem, which we regard, occurs, if more than two confiners merge 
together at the same level approximately (see Figure 2(a)). If, due to noise, 
the order of fusion changes, we will have an exchange between the confiner 
subtrees. An approach to deal with this is to use the (intra) tree distance 
(see section 4) during tree identification. We call this problem the problem 
of simultaneous fusion. 
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In echographical images we observed that noise is often locally of little vari- 
ance but of high intensity. Therefore, the confiners of a thin but long region 
(in the image) of high density can be cut at multiple levels due to noise. This 
can cause an exchange between distant subtrees (Figure 2(b) illustrates this). 
This could be considered as the analogue of the chaining effect arising in the 
context of classification of a point distribution (Wishart (1969)). In fact, in 
the classification context a chaining effect appears when thin noise traces re- 
late distant clusters (which can arise in imaging, too). We call this problem 
the problem of chaining effects. 

The third problem addresses the loss of geometrical information. Geometri- 
cal information is hidden in clusters if the density function is not piecewise 
normal. This problem arises also if we try to match extremal points (Thirion 
(1996)) in cases where crest lines (Monga et al. (1992)) would better repre- 
sent the information in the image. This is illustrated in Figure 2(c). We call 
this problem the problem of hidden geometrical information. 




Figure 2: (a) Simultaneous fusion, (b) Chaining effect, (c) Loss of geomet- 
rical information. 



4 Matching Algorithm and Image Distance 

Here we present an algorithm what we have designed to identify the two 
resulting trees. Its interest lies in its simplicity, in its efficiency in compu- 
tation time, and in its robustness with respect to small differences between 
the two trees due to noise. Other tree comparison algorithms are usually 
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application specific, for instance Farris (1969) compares taxonomic trees and 
Lu (1979) compares ordered trees. Shasha et al. (1994) proposed exact and 
approximate algorithms which derive one tree from the other by a sequence 
of several elementary transformations. Associating a cost value to the ele- 
mentary transformations they look for the sequence with minimum cost. 

Let us denote the two trees to compare by A and J3. Our distance definition 
is based on the (intra-) tree distance dT associated to a tree T {T = A or 
T = B), which is defined for each pair of nodes {Ni, N 2 ) by the sum of the 
weights (for the definition of the weights see section 5) associated to the edges 
in the path in the tree, which relates the nodes (A^i,A^ 2 ) (Barthelemy and 
Guenoche (1988)). Let us enumerate the leaves of A (resp. B) by (Af)i<i<M 
(resp. (Sf)i<i<M') {i G IN; M (resp. M') denotes the number of leaves in A 
(resp. B)) in an arbitrary order. We will associate leaves of A (resp. B) with 
nodes of B (resp. A) to obtain associated pairs (Af'^Bi) (resp. (5f,Ai)), 
(i G IN, Bi (resp. Ai) node of B (resp. A)) and the sequence (Af,B^)i<KM 
(resp. {B-^/Ai)i<i<M')- We define the inter-tree distance between A and B 
by minimizing over all such sequences as 

d{A,B) = ^o<i<i<M \dB{Bi,Bj) — dA{Af , Aj) \ +(1) 

T^o<i<j<M' \dB{B-^, Bj) — dA{Ai,Aj)\). (2) 

Figure 3 illustrates the robustness of the use of our inter-tree distance 
d{A,B). 

The central theme behind our algorithm can be seen as the association of 
the leaves of A to the nodes of B such that the minimum in (1) is achieved, 
and reciprocally for (2). With subtree(N), where is a node, we denote 
the subtree rooted in N. We will also use the notations father {N), and 
r(T) = root{T) in the evident sense, child{N) for the set of all children of 
node A, and mass{N) as the sum over all weights in subtree{N). 

In the following algorithm we try to associate each leaf (or a certain number 
of selected leaves, see section 5) of A to a node Bk of B by moving the 
leaf A^ downwords in the tree B. 

Initialization : 

We initialize the leaves of A on the root of B. 

Bifurcation : 

If A^ is associated to node BkQ we are looking for the child of which minimizes 
the best fit value, which we are defining now: 

:= root of the subtree containing A^ and minimizing \mass{BkQ)—mass{Ai)\, 
i e IN; 

aL — aL. 

^ko • ’ 

:= leaf in subtree{Ak^) with largest distance from Aj^\ 

A^^ := leaf in subtree{Ak^), which maximizes the sum of its distances to A^^ 
and 




211 



For each ^ child{BkQ) do: 

bestfit(Bk^) := minimum of the sum below while minimizing over all leaves B^^ 
in subtree(Bk^) and all leaves B^^,B^^ in B: 

- dBiB^^,B^.)\ + j2\dA{Aj^^,riA)) - dB{Bt,,r{B))\. (3) 

0<i<j<2 2=0 

Return the child G child{BkQ) realizing the min of the best fit among all 

those childs. 

If subtree{Bko) or subtree(Ak^) have just two leaves, we use a geometrical crite- 
rion easy to define (based on the centre of gravity of the concerned confiners) to 
decide the final association of A]^. 

We will remark, as we are minimizing over all leaves B ^^ , B^^ in the whole tree 
B to define best f it ^ that it does not care if subtree{Ak^) does not correspond 
exactly to subtree(BkQ). The computation time is 0(M(M')^). In practice, 
we have approximatively M' = 50, M == 10. 





Tree A ( total weight = 44, weight constant = 1 ) Tree B ( total weight = 44, weight constant = 1 ) 

Figure 3: Example for weighted trees A and B (each edge is weighted by 
a certain number of Es) with a subtree exchange and illustration of the 
robustness of the intra-tree distance: for instance |dA(l,5) — dB{a^h)\ = 
|16 — 17| < \dA{l, 5) — dB{a, e)| == 4. If A^ = A^^ = 5 and B^^ = r{B), we 
have Ak^ = — 2, ~ ^ bestfit{Bc 2 ) — 3 for the leaves 

h, 6, i and bestfit{Bcf) = 7 for the leaves d, /, b. 



5 Applications 

Applications of our method include the study of the relative movement of 
grains in microscopic metallurgical images or of cells in microscopic biolog- 
ical images. On the other hand, we can also employ our method to detect 
asymmetries in images, or to detect where in an image movements occur. In 
the first case this serves to detect a tumor and its location. In the second 
case the method helps us to apply watershed, optical fiow, or active contour 
techniques at the found place . If we just want to rediscover the affine trans- 
formation between two succesive 2D slides of a 3D image, we apply a least 
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square fitting of the euclidean distance between the corresponding invariant 
characteristic points found by our identification algorithm. An alternative 
to this is to proceed as Thirion (1996) for matching extremal points and just 
to define the characteristic points with the help of the trees. 

A further interesting application of the proposed method is the comparison 
of two brain images resulting from functional MRI. Here occur small local 
deformations as well as global shifts. The aim is to locate positions where 
the brain activation (and therefore the intensity of the functional signal) 
changed. 

The Figure 4 shows the tree representations of the images and the selected 
nodes of the trees. The selected nodes are the nodes in the middle of the 
subbranches presenting the largest number of tree levels between two bi- 
furcations and having no subbranch with the same property as successor. 
Figure 5 shows the new position in the lower tree of Figure 4 of the selected 
nodes of the upper tree after the application of our matching algorithm. We 
get the best results (Figure 5) when we weight each edge of the tree just 
by a constant (not grey intensity dependend) such that both trees have the 
same total weight. This results from the fact that the changes in grey in- 
tensity are larger than those concerning the topological structure, especially 
in the upper right part of the brain, where a tumor is located. However, for 
detecting the tumor or its location after matching, weights constructed from 
the intensity values contain useful information. 




Figure 4: Top left: Initial image of the upper slide of the brain image; middle: 
corresponding density tree; top right: highlighted clusters as marked by the 
arrows in the tree; bottom: the same as above for lower slide of the brain 
image. 
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Figure 5: Initial image of the lower slide of the brain image on the left, 
expected clusters found after tree identification marked by arrows in the cor- 
responding density tree in the middle, and highlighted clusters as marked by 
the arrows in the tree on the right. Outlier 2 can be removed with the geo- 
metrical criteria based on the centres of gravity of the concerned confiners. 



6 Conclusion 

In order to extract more geometrical knowledge we have to segment the 
confiners. The segmentation should be invariant with respect to transfor- 
mations and robust to noise. The key to control this is, in our opinion, to 
take into account simultaneously the geometrical (horizontal) aspect and the 
mass (vertical) aspect. This leads us, for instance, to convex density con- 
tours (Sager (1979)) or to the excess mass ellipsoid (Nolari (1991)) on the 
base of which we can construct a hierarchy. 

We have shown a new way to apply classification to image matching and ob- 
ject recognition. We hope researchers normally confronted with density and 
mode estimation and clustering are interested and motivated to investigate 
some of the mentioned problems. 
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Abstract: The well-known ’A;-means’ clustering can be regarded as an approx- 
imation of a given distribution (which can be a sample) by a set of optimally 
chosen k points. However, in many cases approximative sets of different types are 
of interest. For example, approximation of a distribution by circles is important in 
allocating communication stations, the circles being interpreted as working areas 
of the stations. The paper covers two related topics. First we propose a heuristic 
algorithm to find k circles of a given radius r that fit with the planar data set. 
Then we analyse the problem of consistency: does a sequence of sample-based sets 
of optimal circles converge to the class of optimal circles for the population? The 
positive answer is given for arbitrary finite-dimensional normed linear spaces. 



1 Introduction 

In the beginning of the century K. Pearson introduced the notion princi- 
pal components that define ’lines and planes of closest fit to systems of 
points in p-dimensional space’. Since 1950-s, the A:-means clustering has ob- 
tained considerable attention by many authors (see, e.g., MacQueen (1967), 
Bock (1974), Spath (1975), Pollard (1981), Lloyd (1982), Cuesta and Ma- 
tran (1988)). Instead of A:-means, B. Flury (1993) prefers the term principal 
points. More recently, principal curves and principal surfaces have become 
important tools in data analysis (see, e.g., Hastie and Stuezle (1989)). Com- 
putational problems related to the fitting by curves of a specific form (cir- 
cles and ellipses) are treated in Spath (1997). In all above cases, there is a 
common idea that a set of certain type has to be found that fits (i.e. ap- 
proximates) the data in a given sense. We call this problem ’approximation 
of distributions by sets’. 

In this paper, approximation by circles of given radius is considered. Just 
to give another example, chains with given length of links can be useful in 
some cases. 

The general approximation problem can mathematically be stated as follows. 
Let (T, d) be a metric space and X ~ P a random element in it. Let M 
be a class of subsets of T. For any M ^ M we define the distance from 
the random element X to the set M as the distance from X to its closest 
point in M, i.e. d(M, X) inf{d{m,X) : m G M}. Let (j) : be 

a nondecreasing function satisfying (j){0) = 0 (often (j){d) = is used). The 
general approximation problem can now be stated as follows: minimize the 
loss-function 

W(M, P) := E<t>[d{M, X)] ^ min . (1) 
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In the next section we consider a special case where the space T = and 
the class M consists of all possible unions of k circles of a given radius r. 
The distribution P is represented by a sample Af of n points. 



2 Optimal Circles for Planar Data 



Let US have a sample = {xi, C from a distribution P. Let Pn 

be the distribution that assigns equal probabilities 1/n to each point in X. 
The task is to find k circles of radius r {k and r are given numbers) that fit 
the set X as well as possible. More specifically, the term ’fit’ is interpreted 
as the covering property of k circles and - for the data points that will not 
be covered - their closeness to the k circumferences. Therefore, difficulties 
can arise only when k and r are relatively small - this is the situation we are 
focusing at. 

The work is motivated by the problem of finding good allocation for k com- 
munication stations of a given working radius r. It is reasonable to minimize 
the summed squared distance from the customers Xi living outside k working 
areas to their individually closest working areas. We can formulate the idea 
in the framework of the problem (1) working with the circle centres rather 
than the circles themselves. Let A = {ai, be a set of k circle centres 

and let II • II denote the Euclidean norm. Then our task is to minimize 



1 

W {A, Pn) •= I \^i ~ % 

^ 7 = 1 % 



min , 

A:\A\=k 



where the function (j) is defined as 




if 0 < X < r, 
if X > r. 



( 2 ) 



( 3 ) 



Next we propose a two-stage iterative algorithm to find ’optimal’ centres or, 
equivalently, optimal circles. In the case of r = 0 the algorithm reduces to 
the well-known Lloyd’s method for computing ’/c-means’ (Lloyd (1982)). 
The algorithm consists of the following principal steps: 

Step 1 Specify initial values for the centres of k circles by choosing arbitrary 
k points on the plane, or, alternatively, by choosing k random points 
from the data set X. 

Step 2 Find the corresponding Voronoi partition, i.e. classify each point of 
X to its nearest centre in the current set of centres (such a partition 
is also called minimal distance partition) . 

Step 3 Find an optimal approximating circle for each region of the Voronoi 
partition obtained by Step 2 (this is done by a heuristic procedure 
which is described below). 
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Step 4 (Small loop) Repeat Steps 2 and 3 until there is no significant im- 
provement in the locations of centres, or the number of iterations ex- 
ceeds a prescribed value. 

Step 5 (Big loop) Repeat Steps 1-4 starting with different initial values, 
each time resulting in a locally optimal solution. Then choose the best 
among them. 

fs) (s) 

Step 3 is crucial and it needs some additional explanation. Let ^ ...,a^ 
be the current set of k circle centres (set 5 = 0 on Step 1) and the 
circle with the centre and radius r. Let {S [^\ be the Voronoi 
partition of the set X, generated by the points Finally, denote 

the number of points in by and their arithmetic mean by 

Xj€Sj*> 

Note that, in the case of r = 0, Step 3 reduces to the calculation of arith- 
metic means ^ j — 1, ..., fc. In the general case when r > 0 Step 3 devides 
into substeps 3.1 - 3.2. 

Step 3.1 Find improved circle centres by the formula 

if = 0, 

if = 1 or 2, 

if nj"’ > 3. 



radius which covers all points 
of the region Let and be the centre and the radius 
of the smallest covering circle. Set 

^(5+1) _ I if < r (full cover), 

^ \ go to Step 3.2.2., otherwise. 

Step 3.2.2 Calculate the values of the loss-function at A^+ 1 equally 
spaced points between and i.e. at the points = 
XiX^j^^ + (1 “ with Xi = i/N, i = 0, 1, . . . , A^. 

Take the best point for the new centre 



d^+i) 



Xj, 

d.j , 

[ go to Step 3.2, 



Step 3.2 (Apply only when > 3). 

Step 3.2.1 Find the circle of smallest 
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The idea of the Step 3.2.2. is that the two-dimensional search for a good 
centre is replaced by an one-dimensional search carried through on an ’strate- 
gic’ interval. The choice of the parameter N mainly depends on the required 
precision and computational resources. 

Remark We have studied the algorithm with respect to some important 
aspects (like stability and correctness), using the Monte-Carlo method. The 
stability of the algorithm was investigated by a series of tests which demon- 
strated that small changes in data result in small changes of the best (gained) 
value of the loss-function S. Still, one can not expect small changes in op- 
timal circle centres at the same time, because the loss-function can be very 
flat. As to the correctness of the algorithm, we tested it in the case of r = 0 
using a large sample X from a two-dimensional normal distribution. As 
it was expected, optimal 2-centres for large samples practically coincided 
with those of the parent population. (The latter are well-known: optimal 
/c-centres for the two-dimensional normal distribution have been calculated 
by several authors, see, e.g., Bock (1998), Kipper and Parna (1992), etc.). 
In statistical terms, the results of our Monte-Carlo experiment with zero 
radius and large samples from the normal distribution also suggested possible 
consistency of empirically optimal circles, i.e. convergence of estimated 
circle centres to their corresponding population values. As consistency is one 
of the most desired properties of any statistical estimator, we next present 
some results concerning strong consistency of empirically optimal circles. 



3 Strong Consistency of Optimal Circles 

Obviously, the convergence of circles of given radius is equivalent to the 
convergence of their centres. In the following it will be shown that, under 
fairly general conditions, any sequence of empirically optimal circle centres 
converges, as the sample size tends to inflnity, to the class of optimal circle 
centres of the population. The ’convergence to the class’ means (literally) 
that, for any set An of empirically optimal k circles, there exists a set of k 
circles which is optimal with respect to the population and close to An^ pro- 
vided that the sample size n is large. We study the problem in an arbitrary 
finite-dimensional normed linear space (instead of just 5ft^). 

To prove the consistency of A:-centres, some ideas from Pollard (1981) can be 
used, in principle. Still, we follow the method utilised in Cuesta and Matran 
(1988) which relies on Skorohod Representation Theorem. We also make 
use of the compactness of closed balls in finite-dimensional spaces. However, 
some of the results can be generalized for infinite-dimensional spaces as well. 
Note that we do not assume the uniqueness of the A:-centre, in contrary of the 
papers mentioned. This extends the applicability of our results, for example, 
to different symmetric distributions in which case we have, as a rule, many 
optimal solutions. 
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3.1 Formulation of the Problem 

Let (Q, T ^ P) be a complete probability space, E a finite-dimensional normed 
linear space (nls) and {-^n} a sequence of i.i.d. £'-valued random vectors 
with a common distribution P. Denote via {P,^} the corresponding sequence 
of empirical distributions, i.e. {P!^} assigns equal probability 1/n to each 
element of the random sample {Xi(w), . . . , X„(w)}. Let Sk = {A C E \ 
1^1 < k]. 

Considering the function <f>, defined by (3), we assume that /0(||a:||)dP < oo 
or, equivalently, / l|x|pdP < oo. In terms of the parent distribution P the 
minimisation problem (2) writes as 

W(A,P):= / (/i[min ||x — a||]dP — >■ min . (4) 

J aeA 

Let us denote the optimum value of the loss-function by Wk{P) = Wk{X) := 
inf{W{A, P)\A G £k} and the class of all optimal sets of k points by Uk{P) = 
Uk{X) - {A G £k\W{A, P) = Wk{P)}. 

Definition 1 Each A G Uk{P) is called the ’^-centre’ of the measure P (of 
the random element X). 

We now specify the mode of convergence for A:-centres. 

Let h{A^ B) be the Hausdorff ’s distance between two subsets A,BcE. The 
convergence h{An^A) -> 0 in the metric space (£k,h) will be denoted via 
An — > A. 

Definition 2 The Hausdorff’s distance between A C E and a class U of 
subsets of E is defined by 

h{A,U) = inf {h{A,B) :B eU}. (5) 

We are interested in the following problem: does any arbitrary sequence of 
empirical k-centres A^ G Uk{Pn) converge (for P-a.e. uj) to the class Uk{P) 
of theoretical /c-centres in the Hausdorff’s metrics? In the next the problem 
will be restated in terms of a sequence of almost surely converging random 
variables, instead of the measure-sequence {Pn}. 

First note that, by Kolmogorov’s SLLN, for each fixed A ^ £k, we have 
W {A, P!^) W {A, P), a.s. From this it is not difficult to obtain the relation 

lminfW,(P,^) < W,(P), a.s. (6) 

Indeed, take e > 0 and let A e £k he an e-optimal set for P. Since 
^k{Pn) ^ W{A,Pj^) -A W{A,P) < Wk{P) + e almost surely, we have 
liminf^ Wk{Pn) < Wk{P) + e , a.s. As e is arbitrary, (6) follows. Let the set 
where the inequality (6) holds be Qi. 

Lemma 3 There exists a set ^ P with P(f2o) = 1 such that for each 
a; G fio there exists a sequence n = 0,1,..., of P-valued random 

elements defined on the probability space ([0, 1], P, Leb) and satisfying 
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2) ^ Leb-a.s., 

3) iiminfiy,(y-)<iy,(yo"). 

Proof: Let the symbol denote weak convergence of probability measures. 
Due to Varadarajan (1958), we have that P{cj|P^ F} = 1 - the sequence 
of empirical measures converges weakly to the parent measure with proba- 
bility one. Let the corresponding set be ^ 2 - By Skorohod Representation 
Theorem, for every uj G ^^ 2 , there exists a sequence {L^}, n = 0,1,..., 
satisfying 1) and 2). Now it remains to take flo = H fl 2 - A 
Since the value of the loss function W{A, •) depends on the distribution only, 
we have, by 1), that Uk{Y^) — Uk{Pn), for n = 1,2,..., and Uk{YQ) = 
Uk{P)^ Vo; G f^o- Therefore, instead of minimising W w.r.t. the mea- 
sures => P, we can minimize W w.r.t. random variables Y^ Y^ 
a.s. For further convenience we shall denote the latter sequence (again) by 
X,Xn, n = 1,2,..., and instead of ([0, 1], P, Le&) we consider the general 
space (fi,P, P). Now the initial problem can be restated in the follow- 
ing way. Let (fl,P, P) be a probability space and X,X^ G L2(f2,P),n = 
1, 2, . . ., be a sequence of P- valued random variables satisfying both Xn — > X 
a.s. and \iminfnWk{Xn) < Wk{X). Given that, is it true that 

sup h{AnMk{X)) -^0, if n — )> oo? (7) 

AneUk(Xn) 



As a first step, we show that /c-centres from the classes Uk{X) and Uk{Xn), 
n = 1, 2, . . . , are uniformly bounded. 

3.2 Preliminary Results 

For any closed subset B, let Ub{X) be an element of B which is closest to 
X (a projection). 

Proposition 4 If A G L 2 {^, P), then there exists a sequence of finite sets 
{Cn} such that W{Cn,X) 0. 

Proof: Since X G L 2 {^,E), there exists a sequence of simple functions {Pn} 
such that \\Zn — X \\2 -A 0. Using the definition of 0 we can write that 
/ 0(1 |Pn - -^ID^P ^ I\\En- A|pdP -> 0. Let Cn be the set of the values 
ofP,. Theniy(Cn,A)-/0(||A-ncjA)||)c/P</0(||A-P,||)dP^O. A 
Proposition 5 Let X G Z/ 2 (fi,P), X ^ P. If Wk{X) > 0, then the strict 
inequality Wk{X) < Wk-i{X) holds. 

Proof: In the present proof we do not assume the existence of /^-centres. 
Suppose, vice versa, that Wk{X) = Wk-i{X). Then for Ve > 0 there exists 
a subset B — B{e) G Sk-i such that VF(P, A) < Wk{X) + e. Therefore, for 
Vc G P, we have W{B, A) - VF({P, c}. A) < W{B, A) - Wk{X) < e. Since 
\\x - U{B,c}{x)\\ < Ik ~ Bb(x)|| iff Ik - c|| < ||x - 6||, V6 g P, we get that, 
for every c e 

W{B,X)-Wk{{B,c},X)= [ [cf>{\\x-UB{x)\\)-m^-c\\)]Pidx)<e, 

Js{c) 
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where 5(c) = {x e E : \\x — c|| < ||a: — 6||, V6 G B}. Now, for any 
C — , * * • , , 

W{B,X)-W{BUC,X) < EUTicM\\^-^B{x)\\)-cl>i\\x-Ci\\)]dP 

< U - nB(x)ll) - <t>{\\x - cMdP 

< ne, 

(8) 

where T{ci) {x| ||x — Cj|| < Hx — Cj||, Vj = 1, . . . , n} Pi 5(cj), i = 1, . . . , n. 
Use Proposition 4 to choose a set C = {ci, . . . ,c„} such that Wk-\{X) — 
W{C, X) > and take e < assumption and (8), there 

exists a set B{e) G Sk-i such that W{B,X) -W{BijC,X) < ne < 

From the other side, W{B,X)-W{BLIC,X) > Wk-i{X) -W{BUC, X) > 
Wk-i{X) — W[C,X) > - a contradiction. A 

Proposition 6 Let X,Xn € L 2 , n = 1, 2, • • • , be random variables satis- 
fying liminfn < Wk{X) and Wk{X) > 0. Then there exists a ball 

B := B{0, R) containing all An € Uk{Xn), Vn. 

Proof: We first show that there exists a ball Bi := B{0,Ri) such that, for 
each An € Uk{Xn), AnDBi 0, Vn. If such a ball does not exist, we would 
have (along some subsequence which we denote in the same way) ||a"|| — > 00 
as n — > 00 , i = 1, . . . ,k. Thus, Vw G l|Il^„(A’„(a;))l| 00 and, by the 

convergence X a.s., </.(||X„-n^„(X„)||) > </.(|l|n 4 (X„)|| - 1|X„|||) ^ 

00 a.s. Applying Fatou’s Lemma, we therefore get 

Wk{X)>\im J <^(1|X„- n^„(X„)||)dP>y Iminf 0(||X„ - n^„(X)||)dP = 00 

- in contradiction with X G L 2 . We proceed using induction by k. Let I < k 
and assume that there exists a ball Bi B{0^Ri) such that Vn 3A* 

{ay, . . . , af} C AnH Bi. Let us denote / = {!,...,/} and J = {/ + 1, . . . , A:}. 
We show that there exists a further ball such that, for 

each n, 3ay, . . . , af_^i G Bi^i nAn- Otherwise we would have ||ayl| 00 , for 
Vi G J. Since A* C B/, Vn, are bounded subsets, it is possible to extract 
a subsequence {A*,} : af a^, Vi G L Let A* — {ai, . . . , m < 1. 
Clearly, for P>a.e. u, we have ||Xn'(o;) — ay'|| — > ||X(cj) — a^||, for i G /, and 
\\Xn'{cu) — af'll — > 00 , for i G J. 

Therefore, by continuity of </>, 

<f>{\\Xn> - n^„,(x„oil) d>m - n^-(^)ll) a.s., 

and by the Fatou’s Lemma 

Wk{X) > lirninfWfc(X„0 = limjnf J </.(||X„, - n^^,(X„01|)dP 

> I liminf<^(||X„, - UA^,{Xn')\\)dP = J <t>{\\X - n^.(X)||)dP 
= W{A*,X)>WiiX). 
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We have got Wk{X) > W/(X) - a contradiction with Proposition 5. A 
Corollary 7 In the hypothesis of Proposition 6, there exists a ball 
B 5(0, R) containing all An G Uk{Xn), n = 1, 2, . . . , and all A G Uk{X). 
Proof: To acquire Uk{X), take Xn := X in the proof of Proposition 6. A 

3.3 Consistency Results 

Theorem 8 In the hypothesis of Proposition 6 

sup h{An,Uk{X)) 0. 

AneUkiXn) 

Proof: Choose an arbitrary sequence {An} such that An G Uk{Xn)^ Vn. 
It is sufhcent to show that every subsequence {An>} contains a convergent 
sub-subsequence An" A e Uk{X). 

By Proposition 6 the sequence is bounded. Using the relative com- 

pactness of bounded sets one can extract a subsequence {An"} such that 
af -> Oi, Vz, i.e. An" A = {ai, . . . , a/}, I < k. To show that A G Uk{X) 
we proceed as in the proof of Proposition 6. From the convergence Xn" X 
a.s. we have 



^ 11^ - ^a{X)\\ a.s. 

which gives, using (8), that Wk{X) > W{A,X). Therefore, A eUk{X). A 
In the previous result it was assumed that Wk{X) > 0. In the case of 
Wk{X) = 0 (which can happen when r is relatively large). Proposition 6 is 
obviously not valid any more. However, Theorem 8 still holds. To see that 
we proceed as follows. 

Let An G Uk{Xn) be an arbitrary sequence. Let I be the maximal number 
with the property that there exists a ball Bi containing at least I elements of 
each An G Uk{Xn), Vn (/ > 1, by reasoning like in the proof of Proposition 
6). Let ~ {ay, . . . , a”} C fl 5/. Then, along some subsequence all 
points from An \ A* tend to infinity. As A* are bounded, there exists a 
further subsequence A"^ -A A* e £i. Now, using Fatou’s Lemma, it becomes 
evident that 

limMWkiXn) > j liminf.^(||X„ - nA„(X„)||)dP = W{A\X) > Wi{X). 

But the left-hand side is bounded from above by Wk{X) = 0. Therefore, 
Wi{X) — 0 and A* € Ui{X). We conclude that A* — )■ Ui{X), i.e. there 
exists a sequence G Ui{X) such that h{A*„,Br) 0. By Wi{X) = 0, 

any k-point set which contains is a member of Uk{X). Thus /i(A„, {5* U 
(A„\A*)}) 0, or An Uk{X). A 

In terms of empirical A:-centres we have obtained the following result: 
Theorem 9 Let be a sequence of empirical measures corresponding 

to P given on a finite-dimensional nls E. Assume that /||a:|pP(dx) < oo. 
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Then, for P- a.e. cj, any sequence of empirical fc-centres converges in the 
Hausdorff’s metrics to the set of theoretical fc-centres, i.e. 

P{a;| sup /i(A-,Z^fc(P))->0} = l. 

A<ieUk{Pii) 

4 Conclusion 

Our main result, Theorem 9, has direct applications in classification theory. 
It applies each time when a researcher is seeking for an optimal cover of a 
population by k circles having only a sample from it (when analysing, for 
example, empirical data in various fields, simulated data in Monte -Carlo 
studies, etc.). According to Theorem 9, the researcher can be sure that any 
empirically optimal set of k circles, found on the basis of a sample, is close to 
a theoretically optimal set of k circles, provided that the sample size is large 
enough. Due to the general form of the loss-function where the radius r is a 
free parameter, the convergence of empirically optimal /c-centres takes also 
place for the case of r == 0, the case which is known as ’/c-means’ clustering. 

Acknowledgements 

The authors would like to thank referees for useful remarks. 

This research was supported by the Estonian Science Foundation (Grant No. 
1875). 

References 

BOCK, H.H. (1974): Automatische Klassifikation, Theoretische und Praktische 
Methoden zur Gruppierung und Strukturierung von Daten (Cluster analyse). 
Vandenhoeck & Ruprecht, Gottingen. 

CUESTA, J. and MATRAN, C. (1988): The strong law of large numbers for 
/c-means and best possible nets of Banach-valued random variables. Probability 
Theory and Related Fields, 78, 523-534- 

FLURY, B.A. (1993): Principal points. Biometrika, 77, 33~42. 

HASTIE, T. and STUETZLE, W. (1989): Principal curves. Journal of American 
Satistical Association, 84, 502-516. 

LLOYD, S. P. (1982): Least squares quantization in PCM. IEEE Transactions on 
Information Theory, 28, 129-136. 

MACQUEEN, J. (1967): Some methods for classification and analysis of multi- 
variate observations. Proceedings of the Fifth Berkeley Symposium on Mathemat- 
ical Statistics and Probability, I, 281-297. 

KIPPER, S. and PARNA, K. (1992): Optimal A:-centres for a two-dimensional 
normal distribution. Acta et Commentationes Universitatis Tartuensis, 942, 21- 
27. 




224 



POLLARD, D. (1981): Strong consistency of /c-means clustering. Annals of 
Statistics, 9, 135-140. 

SPATH, H. (1975): Cluster- Analyse- Algorithmen zur Objektklassifizierung und 
Datenreduktion. R. Oldenbourg Verlag, Miinchen - Wien. 

SPATH, H. (1997): Orthonormal distance fitting by circles and ellipses with given 
area. Journal of Computational Statistics, 12, 343-354- 

VARADARAJAN, V.S. (1958): Weak convergence of measures on separable met- 
ric spaces. Sankhya, 19, 15-22. 




Computation of the Minimum Covariance 
Determinant Estimator 
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Abstract: Robust estimation of location and scale in the presence of outliers is an 
important task in classification. Outlier sensitive estimation will lead to a large 
number of misclassifications. Rousseeuw introduced two estimators with high 
breakdown point, namely the minimum- volume-ellipsoid estimator (MVE) and the 
minimum-covariance-determinant estimator (MCD). While the MOD estimator 
has better theoretical properties than the MVE, the latter one appears to be used 
more widely. This may be due to the lack of fast algorithms for computing the 
MCD, up to now. 

In this paper two branch- and-bound algorithms for the exact computation of the 
MCD are presented. The results of their application to simulated samples are 
compared with a new heuristic algorithm “multistart iterative trimming” and the 
steepest descent method suggested by Hawkins. The results show that multistart 
iterative trimming is a good and very fast heuristic for the MCD which can be 
applied to samples of large size. 



1 Introduction 

Consider the observations Xi, . . . , G of the random variables Xi, . . . , 
Xn- A typical distributional model might be that the Xi are independent 
identically distributed with an elliptical symmetrical density. Given a subset 
M of {xi,... ,Xn} let mean(M) denote the sample mean and var(M) 
n ~ inean(M))(xi — mean(M))^ denote the sample covariance matrix 
of M. 

Definition 1 (Rousseeuw (1983)) A A:-Minimum- Volume-Ellipsoid esti- 
mator^ in short MVE(A:)-estimator, G {1, . . . ,n}, is a mapping t: R^^^ 

R^ with tix) = center of a smallest ellipsoid covering at least k points of 
x = (xl,...,xOGR^^^ 

A /c-Minimum-Covariance-Determinant estimator, in short MCD(/c)-estima- 
tor, k G ,n}, is a mapping t: R^^” -> R^ with t{x) = mean of k 

points of X for which the determinant of the sample covariance matrix is 
minimal 

The corresponding k points will be called a MCD(A:) subset. Note that this 
set is not necessarily unique. 

To obtain a robust estimation of location, the parameter k should be a lower 
bound for the number of non-outliers. The value k := [(n + d H- 1)/2J gives 
the highest possible breakdown point. See Rousseeuw and Leroy (1987). 
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MVE(fn/2l) MCD(rn/2l) 

permutation invariant permutation invariant 

affine equivariant affine equivariant 

breakdown point — d + Ij/n breakdown point — d + Ij/n 

n^/^-consistent n^^^-consistent 

ML-estimator of a [n/2j -outlier model 
useful fast heuristics only slow heuristics 

very popular 



Figure 1: Comparison of MVE and MCD 



Figure 1 compares the two estimators. The MCD estimator is the maximum 
likelihood estimator of a \n — /jJ -outlier model where the non-outliers are 
normally distributed. It has a higher order of convergence, see Davies (1992) 
and Butler, et al. (1993). Therefore the MCD estimator has the better 
theoretical properties. Nonetheless the MVE estimator appears to be more 
widely used. This is due to the lack of fast algorithms for the MCD. 

The main contributions of this paper are exact and heuristic algorithms for 
the fast computation of the MCD estimator. An extended version of this 
article including proofs can be found in Pesch (1998). 



2 Exact Computation of the MCD Estimator 

A naive approach for true global optimization of the MCD is the exhaustive 
enumeration of all subsets of size k and the computation of their covariance 
matrix determinants. This approach can be implemented as a depth first 
search in a tree like the one in figure 2 which shows the case n = 6 and 
k = 3. Each leaf of this tree corresponds to a /c-element subset of {1, . . . , n} 
hence the tree will be called subset tree. At each leaf the covariance matrix 
and its determinant have to be computed. The resulting algorithm will be 
denoted by MCD-NA. 



di^ Ti, Si 

d2^ S 2 



dz^ T3, Sz 




34564565664565665666 



Figure 2: The subset tree 



A speedup of the naive approach can be obtained using the following well 
known update formulas. Let Ti ~ i mean(o;i, . . . , Xi), Si := i var(a;i, ... ,Xi) 
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and di ~ det(5j). For each i G {1, . . . , n — 1} we have 

=Ti-}- Xi^i 

Si-\-i = Si -\ — j —{Ti . 

i[i + Ij 

Using these formulas we can compute Ti and Si in the i-th level of the 
subset tree in constant time given T^_i and Si-\. This approach results in 
a constant speedup of the naive algorithm. We denote this algorithm by 
MCD-FA. 

To deal with larger sample sizes efficiently, the number of subsets that need 
to be computed must be reduced. Agullo (1994) uses a branch and bound 
algorithm to reduce the number of subsets needed to compute the MVE esti- 
mator. A similar bounding heuristic for the MOD estimator is the following 

Lemma 1 Let Xi, . . . , Xjt G and di := det(i var(a;i, . . . ,Xi)). Then 
Vi G {1, . . . , A: - 1} : > di. 

In the i-th level of the subset tree the determinant di is computed and 
compared with the minimal determinant dk found so far. See figure 2. If di 
is greater than or equal to this minimum, the search in the current branch of 
the tree can be skipped. This branch-and-bound approach can be enhanced 
by computing a good approximation of the minimal determinant for bestDet 
instead of setting it to oo at the beginning. An additional improvement 
can be achieved by sorting Xi, . . . , in decreasing order according to their 
Mahalanobis distance wrt. the sample mean and covariance. Points near the 
boundary of the sample will then be considered first. Hence we can hope 
that more branches near the top of the tree can be skipped after sorting the 
points. We call this algorithm MCD-BB. 

A second, more sophisticated bounding heuristic that can be used in low 
dimensions d is given in the following lemma. 

Lemma 2 Let Xi, . . . , G R^ be lexicographically ordered and 1 < k < n. 
For all MCD(k ) subsets {xi ^ , . . . ^Xi^] with var , . . . , ) positive definite^ 

ii < ' • ' < ik, CL'^d all 1 < j < k we have 

open-convexhull(xii, . . . ,Xi.) n {xi, . . . ,Xn] C {xi ^,. . . ,Xi.}. (1) 

If the Xi are lexicographically ordered, the convex hull of the first j points of 
a MCD subset contains none of the other n — j points. Hence subtrees corre- 
sponding to subsets with a convex hull containing one or more of these n — j 
points need not be considered. Algorithms for the incremental construction 
of the convex hull can be found in Preparata and Shamos (1988). 

We call the resulting sweepline algorithm MCD-SW. It can be combined with 
the first branch-and-bound algorithm in order to skip more subtrees. The 
disadvantage of this combination, denoted by MCD-SWBB^ is the necessity 
to compute all the z G {1, . . . , A:}, and not just the dk in the cases where 
the bounding heuristic (1) could not be applied. 

A more detailed description and a programming language like representation 
of the algorithms can be found in Pesch (1998). 
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3 Heuristic Algorithms 

In real world applications an algorithm for sample sizes n of some thousands 
and more is needed. Simulations show that the branch-and-bound algo- 
rithms introduced in this paper can not be used for this purpose in practice. 
See section 4. Hawkins (1994) presented a steepest descent method to com- 
pute a MOD subset. In this method, denoted by MCD-SD, the current 
determinant can be decreased by exchanging an outlier against a non out- 
lier. The pair with the largest decrease of the determinant is chosen. This 
is repeated until no such pair can be found. The result is a local minimum. 
Simulations show that the global minimum is reached with high probability. 
Unfortunately the average time complexity is quadratic wrt. the sample size. 
For larger sample sizes a faster heuristic is needed. 

A widely used heuristic for robust estimation of location is the iterative 
trimming as introduced by Gnanadesikan and Kettenring (1972). 

Algorithm 1 (Iterative Trimming) 

Input: xi, . . . , Xn ^ A: G {d + 1, . . . , n} 
repeat 

ji mean(xi, . . . ,Xk); 

S := var(xi,... ,Xk); 

rearrange Xi, . . . , such that Xi, . . . , Xk have the k smallest 
Mahalanobis distances wrt. fi, E; 
until no rearrangement was necessary; 

Output: mean(a;i, . . . , Xk) 

In the case of different points with the same Mahalanobis distance they are 
sorted according to their indices. The following theorem shows that the sub- 
stitution of some points of a subsample by points with lesser Mahalanobis 
distance decreases the determinant of the subsamples covariance. Therefore 
the determinant of S in the algorithm is strictly decreasing until termina- 
tion and the algorithm can be seen as a heuristic algorithm for the MOD 
estimator. 

Theorem 1 Let /i be the mean, V the positive definite covariance matrix 
of the sample X\, . . . , Xk and yi, . . . , ym some points, m < k, such that 
V var(yi, . . . , , Xk) is positive definite and 

m m 

- n)> - nYV~\yi - n). 

2=1 

Then det(U) > det(U). 

Applied to different permutations of the X\, . . . , x^ iterative trimming will 
in general compute different results. Therefore we can extend it by applying 
iterative trimming to many different random permutations and taking the 
minimum of the resulting covariance matrix determinants. We call this 
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method multistart iterative trimming {MCD-MIT). A combination of this 
algorithm and Hawkins steepest descent algorithm is possible by using the 
result of iterative trimming as a starting point for steepest descent. This 
combination is denoted by MCD-MITSD. 

Unaware by the author, Rousseeuw and Van Driessen (1997) developed the 
algorithm FAST-MCD, which is essentially the same as MCD-MIT. Their 
paper also contains a version of the above theorem. 



4 Experimental Results 

The algorithms have been implemented in C++ and compiled on a SUN 
Ultra 1 Model 140. Figure 3 shows the average running time of the ex- 
act algorithms applied to five different realizations of independent standard 
normal distributed samples in the two dimensional space. The algorithms 
computed the MCD([n/2j) estimator. Note that the time axis is logarithmi- 



MCO estimator (50%, no outliars^ 




Figure 3: CPU time of the MCD algorithms 



cally scaled. The linear curve of MCD-NA shows its exponential complexity. 
MCD-FA is about 8 times faster. It needs more than 22 hours for a sample 
of size 36. The branch- and-bound algorithms can reduce the exponential 
complexity significantly. The best algorithm for this case is MCD-SWBB. 
It needs less than 21 hours for samples of size 100. In the presence of out- 
liers the algorithms perform slightly better. Applied to the MCD(9n/10) 
estimator they are able to compute samples of size 260 in one day. In low 
dimensions these algorithms can be used for small sample sizes. 
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To compare MCD-SD with MCD-MIT and MCD-MITSD each algorithm 
was applied to five different realizations of two dimensional standard normal 
distributed samples of size 100. To simulate the presence of outliers 10 of 
these points are realizations of a centered normal distribution with covari- 
ance 36/. The algorithms tried to find the MCD(50) estimator and each 
one was started from 10 000 different permutations of the 100 points. Ta- 
ble 1 shows the minimal determinant (computed by the exact algorithms), 
the best value found by the heuristics, the relative error, the percentage of 
starts that resulted in the best value and the average computation time for 
one start in seconds. The practical performance of all three algorithms is 
very good. MCD-MIT is the fastest of the algorithms but could not find the 
global optimum in one case. 





Exact 


Heuristic 


Error 


# 


Seconds 


MCD-MIT 


1 


0.080246 


0.080246 


0% 


1.6% 


0.00390 


2 


0.133001 


0.133001 


0% 


6.3% 


0.00512 


3 


0.113353 


0.113353 


0% 


7.7% 


0.00460 


4 


0.164044 


0.164178 


0.08% 


12.4% 


0.00516 


5 


0.136785 


0.136785 


0% 


0.2% 


0.00578 


MCD-MITSD 


1 


0.080246 


0.080246 


0% 


71.1% 


0.03843 


2 


0.133001 


0.133001 


0% 


100.0% 


0.03273 


3 


0.113353 


0.113353 


0% 


26.2% 


0.02683 


4 


0.164044 


0.164044 


0% 


0.8% 


0.07012 


5 


0.136785 


0.136785 


0% 


96.8% 


0.04623 


MCD-SD 


1 


0.080246 


0.080246 


0% 


93.3% 


0.23158 


2 


0.133001 


0.133001 


0% 


100.0% 


0.23692 


3 


0.113353 


0.113353 


0% 


41.1% 


0.22653 


4 


0.164044 


0.164044 


0% 


39.6% 


0.23272 


5 


0.136785 


0.136785 


0% 


76.8% 


0.22832 



Table 1: Comparison of Iterative Trimming and Steepest Descent (n = 100) 



Table 2 shows the results for the MCD(180) estimator applied to samples 
of size 200 with 20 outliers. All three algorithms become more stable with 
k = 180 near to n = 200. Again MCD-MIT is the fastest of the algorithms 
but could not find the global optimum in one case. 

Table 3 compares the heuristics for a sample size of 4000 of which 400 points 
are realized as outliers. The algorithms tried to compute the MCD(2000) es- 
timator. MCD-MIT was applied to 10 000, MCD-MITSD to 1 000 and MCD- 
SD to 10 starts. All three algorithms found the same minimum. MCD-MIT 
is about 30000 times faster than MCD-SD. Therefore it can be applied to 
much more starts than the latter. This significantly increases the probabil- 
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Exact 


Heuristic 


Error 


# 


Seconds 


MCD-MIT 


1 


0.828053 


0.828053 


0% 


100% 


0.00493 


2 


1.095320 


1.095320 


0% 


100% 


0.00800 


3 


0.992113 


0.992113 


0% 


100% 


0.00795 


4 


0.866272 


0.866272 


0% 


100% 


0.00959 


5 


0.890182 


0.890194 


0.001% 


100% 


0.00788 


MCD-MITSD 


1 


0.828053 


0.828053 


0^ 


100% 


0.01821 


2 


1.095320 


1.095320 


0% 


100% 


0.02141 


3 


0.992113 


0.992113 


0% 


100% 


0.02131 


4 


0.866272 


0.866272 


0% 


100% 


0.02324 


5 


0.890182 


0.890182 


0% 


100% 


0.03459 


MCD-SD 


1 


0.828053 


0.828053 


0%^ 


100% 


0.23967 


2 


1.095320 


1.095320 


0% 


100% 


0.23982 


3 


0.992113 


0.992113 


0% 


100% 


0.24024 


4 


0.866272 


0.866272 


0% 


100% 


0.24281 


5 


0.890182 


0.890182 


0% 


100% 


0.24653 



Table 2: Comparison of Iterative Trimming and Steepest Descent (n = 200) 



ity of finding the global minimum. In this case MCD-MIT or MCD-MITSD 
should be the method of choice. 





MCD-MIT 


MCD-MITSD 


MCD-SD 


Heuristic 


# 


Sec. 


# 


Sec. 


# 


Sec. 


1 


0.122268 


12.07% 


0.3488 


19.8% 


25.43 


70%“ 


13844.2 


2 


0.121194 


12.85% 


0.4275 


24.9% 


54.21 


10% 


13684.9 


3 


0.115212 


5.75% 


0.2791 


28.7% 


58.29 


40% 


14013.1 


4 


0.114514 


3.21% 


0.3034 


5.2% 


38.96 


90% 


13668.1 


5 


0.113181 


27.44% 


0.4517 


100.0% 


93.52 


100% 


13807.6 



Table 3: Comparison of Iterative Trimming and Steepest Descent (n — 4000)' 



5 Conclusion 

The sweepline algorithm MCD-SW presented in this text provides an exact 
method to compute the MCD estimator for small sample sizes. Sample sizes 
of some hundreds can be handled by a steepest descent method MCD-SD 
introduced by Hawkins (1994). For sample sizes of several thousands or more 
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steepest descent becomes inadequate because of its high computation time. 
Multistart iterative trimming and its combination with steepest descent as 
presented in this paper are able to close this gap and compute the MOD 
estimator with high probability in far less time. With these algorithms the 
MCD estimator becomes a usable alternative to the MVE estimator. 

Acknowledgement: I thank the referees for their valuable comments, es- 
pecially for the important reference of Rousseeuw and Van Driessen (1997). 
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Abstract: Cerioli and Riani (1998) recently suggested a novel approach to the 
exploratory analysis of spatially autocorrelated data, which is based on a forward 
search. In this paper we suggest a modification of the original technique, namely 
a block-wise forward search algorithm. Furthermore, we show the effectiveness of 
our approach in two examples which may be claimed to be ‘difficult’ to analyse 
in practice. In this respect we also show that our method can provide useful 
guidance to the identification of nonstationary trends over the observation area. 
Throughout, the emphasis is on exploratory methods and joint display of cogent 
graphical plots for the visualization of relevant spatial features of the data. 



1 Introduction 

With a random sample of n observations, say z = [zi, . . . , it is well 
known that multiple outliers can lead to failure of single-case deletion di- 
agnostics due to masking and swamping effects (Barnett and Lewis, 1994). 
The same is also true in the case of autocorrelated data, when z can be 
thought of as a partial realization of a spatial process observed at n sites 
Si, . . . , Sn. Specifically, a spatial outlier is defined as an observation which 
is extreme with respect to its neighbouring values. Clusters of spatial out- 
liers may be attributed either to local contamination or to nonstationary 
pockets. In addition, there may be problems due to trend surfaces over the 
observation area. 

Cerioli and Riani (1998) recently suggested a new approach to the ex- 
ploratory analysis of spatial data, which rests upon resistant estimation of 
spatial dependence parameters at a preliminary stage, and then on running 
a forward search through the data. Their method is applied to the kriging 
model of geostatistics, where the goal is optimal spatial prediction of un- 
known process values (Cressie, 1993). In that context, Zi is conceived as the 
realization at site Sj of the random field 

Z(s) = /x + 5(s) + e(s) seD, (1) 

where /i denotes a fixed but unknown constant, J(s) is an intrinsically sta- 
tionary (Gaussian) random field with mean zero, e(s) denotes a zero-mean, 
white-noise process, independent of 5(s), whose variance defines the mea- 
surement error variability, and D C IB?. Interest then lies in predicting 
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either the observed process Z{s) or the smooth process 5(s). Single-case 
diagnostics are thus given by standardized prediction residuals 

~ ^ 1 , . . . , 71 , ( 2 ) 

where iz(z_i) denotes the predicted value of Z{ (or of its noiseless version 
6{si)) based on the (n — 1) observations z_^ = [zi , . . . , . . . Zn]', and 

^i{z-i) corresponding mean-squared prediction error (Christensen et 

al, 1992). 

The purpose of the present paper is twofold. First we suggest a modification 
of the original technique where (randomly) selected blocks of spatial units 
are used to initialize the forward search algorithm. Secondly, we show the 
power of our method in getting inside the spatial structure of the data in two 
examples where relevant features may be difficult to grasp through standard 
exploratory and model-based techniques for spatially autocorrelated data. 



2 Identification of Spatial Outliers 

A forward search algorithm similar to that of Atkinson (1994) and Atkin- 
son and Riani (1997) can be combined with resistant estimation of spatial 
dependence parameters in order to identify masked multiple outliers in spa- 
tial prediction models (Cerioli and Riani, 1998). The basic version of this 
algorithm rests upon the definition of an initial subset oi m « n spatial 
locations which is intended to be outlier free. If n is moderate, this subset is 
chosen by exhaustive enumeration over all distinct m-tuples of sites, by 
minimizing a least median of squares criterion. Otherwise, random selection 
of the initial subset might be preferred for computational reasons. 

In this paper we consider an alternative approach where p blocks of contigu- 
ous spatial units are (randomly) selected as an initial subset for the forward 
search. Blocking methods for dependent data have been extensively used in 
recent years in many fields of spatial statistics (see e.g. Garcia-Soidan and 
Hall, 1997). The reason is that spatial processes typically exhibit short-range 
dependence and sampling of isolated sites does not preserve the original au- 
tocorrelation structure in the data. Put in the present context, we might 
expect that prediction based on well-behaved values at locations close in 
space can be more effective in detecting local anomalies near the selected 
blocks. 

For simplicity, we restrict to the case where locations Si , . . . , s„ lie on a reg- 
ular grid with rii rows and ri 2 columns, so that n = rii x U 2 - More general 
spatial structures can be dealt with in a similar fashion, either by approxi- 
mation through a regular lattice or by definition of suitable neighbourhood 
sets. Furthemore, we assume that exactly b sites belong to each block. We 
also suppose that at a preliminary stage spatial dependence has been es- 
timated in a resistant way, using any of the methods suggested by Cressie 
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and Hawkins (1980). Our block-wise forward search algorithm can thus be 
summarized as follows. 

Step 1: Choice of the initial subset 

Let Si, j = [il,i2]^ be possibly overlapping blocks of b contiguous spatial 
locations. In our applications, we define Si as the rectangle consisting of the 
integer pairs i = [^1,^2]' such that < k < [jk -l)lk + h,^oi k = 

1,2, where bk and Ik are positive integers defining block size and overlapping, 
respectively. Note that 5^ is defined only for j such that jit = 1, . . . , gjt, where 
Qk = [{'f^k — bk)/lk] + I, k = 1,2, and [•] denotes the integer part. Therefore, 
the total number of blocks available from Si, . . . ,Sn is q = qiX Q 2 , while each 
block has 6 = 61 x 62 observations. To avoid waste of information, we also 
assume that Ik < bk and that Uk is an integer multiple of bk. 

The initial subset of the forward search is selected by choosing p > 1 blocks 
among the q available. At this stage either random or exhaustive selection 
can be performed, according to which iteration scheme is adopted (see Step 
3 below). Let m be the number of distinct spatial locations included in the 
initial subset. If Ik = bk for A: = 1,2 (i.e. the blocks do not overlap), then 
the initial subset consists of exactly m — pxb distinct spatial locations. On 
the contrary, when Ik < bk and p > 1 some sites may appear more than once 
in the initial subset, and its actual dimension may be m < p x 6. In practice 
this is not likely to be a serious problem, however, since the initial stages 
of the forward search are typically of scant interest per se, due to potential 
instability of results computed from a small number of observations. 

Step 2: Running the forward search 

Given a subset of cardinality r > m, let Ir denote the index set of spa- 
tial locations belonging to this subset, and let be the corresponding r- 
dimensional vector of observations. We move to dimension r + 1 by selecting 
the r + 1 spatial locations with the smallest squared residuals given by 

if i E Ir and if i ^ R, where 6^(2^) = Zi — ii(z^) and defined as in 

equation (2). This move is repeated for m < r < n — 1, and spatial outliers 
are detected by simple graphical displays of a variety of statistics involved 
in the forward search. In our examples we monitor standardized prediction 
residuals e^(z^) for each value of r, and we draw plots of quantities like 

zfr+l) _ ^ J z{max) _ 7^2 

— V^[r+l](zO ~ V^W(zr) 

forr = m + l,...,n — 1, where ^k]{ 2 ,r) denotes the k-th ordered value among 
Step 3: Iteration of the algorithm 

The forward search algorithm can be repeated from different starting points. 
If p = 1 and q is not too large. Step 1 of our procedure can be iterated by 
exhaustive selection of all possible subsets of cardinality b. Otherwise, we 
resort to random selection of q' (< q) initial subsets among the available. 
Let Q be the total number of forward searches performed from the data (i.e. 
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Q = q OY Q = q') and rris denote the actual dimension of the initial subset for 
the 5-th search (5 = 1, . . . , Q). Alternative searches are compared through 
the quantities 

pW = s = (3) 

where the minimum is taken over all stages r = m*, . . . , n — 1 of the search, 

~(med) _ /^2 

^[med]{zr)'> 

med = r + [(n — r)/2], 

and m* denotes the largest value among {rris] 5 = 1, . . . , Q}. 

3 Applications 

The usefulness of our approach is revealed in two examples where standard 
single-case diagnostics fail. Indeed both data sets possess some spatial fea- 
tures which might raise doubts about the actual power of any diagnostic 
method. Even in such ‘difficult’ instances, nevertheless, we show the ef- 
fectiveness of the forward search in identifying spatial outliers and other 
relevant features of the data. 

3.1 Simulated Data With Small Contamination 

In the first example we simulated a gaussian random field at the nodes of a 
9x9 regular grid, whose sites are indexed in lexicographical order. Thus, 
ni = 77-2 = 9 and n = 81. The simulation was performed assuming a constant 
mean over the grid and by modelling spatial dependence through a stationary 
spherical semivariogram (Cressie, 1993) (data available upon request). Then 
we modified by small amounts the simulated values at locations S35, S36 
and S45. This contamination does not seem to produce spuriously large 
observations, possibly except for the value at site S45 which slightly exceeds 
the upper truncation point of the corresponding univariate box plot (not 
shown). Also single-case deletion techniques based on (2) are unable to 
identify any of the manipulated values. 

The forward search algorithm is applied to the contaminated data set, choos- 
ing p = 1 and bk = Ik = 3 for k = 1, 2. Therefore, in this example, we have 
Q = 9 different searches starting from alternative blocks of m == 9 distinct 
spatial locations. Results for the best search according to criterion (3) are 
shown in Figure 1, where we display the plots of and For sim- 

plicity we restrict our presentation to values r > 40, because the last steps 
of the search typically carry most of the information about the underlying 
structure of the data. 

It is important to note that the three contaminated values are the last to be 
included by the forward search. In addition, their outlyingness with respect 
to the general autocorrelation structure in the data is clearly revealed by 
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Figure 1: Simulated data with small contamination. Curves of (a) and 

g(max) 

Co) (b) 





both criteria, as the marked peak at r = 78 in the plot of and the 

sharp ‘elbow’ in the curve of e^”***®) show. These “change” points correspond 
to the inclusion of the first outlier while the decrease at subsequent stages 
is a consequence of the masking effect. 



3.2 Aerial Survey Data 

The spatial data analyzed in this section refer to reflectance values extracted 
from an aerial survey along the south coast of England (Haining, 1987). The 
purpose of the survey is to monitor pollution levels arising from the pumping 
of waste material into the English Channel. Higher reflectance values indi- 
cate higher levels of pollution. The spatial locations where such values have 
been collected form the nodes of a 9 x 9 regular grid (n = 81). These data are 
likely to contain a large-scale effect (trend surface) , an autocorrelation com- 
ponent and local-scale effects (nonstationary pockets). Of course, the closer 
an observation is from the source of pollution, the greater its reflectance 
value is likely to be. In this case the parameter values of the trend compo- 
nent indicate the dispersal gradient. The presence of spatial correlation is 
mainly due to the fact that the reflectance value recorded in one pixel is a 
partial averaging of reflectance values in neighbouring pixels. Local anoma- 
lies are also likely to be present, because the data are remotely sensed and are 
affected by small scale turbulence (e.g. wave action). The reflectance data 
are reported in Figure 2. As in our previous example, sites on the grid are 
indexed in lexicographical order. The corresponding three-dimensional plot 
is shown in Figure 3. This plot clearly highlights the complicated structure 
of the data. Haining (1987) concluded that this data set can be described 
by a “first order trend surface model with autoregressive errors”. Subse- 
quently, the same author (Haining, 1990, pp. 216-219) seemed to reach a 
somewhat different conclusion, and suggested that the data contain “high 
order trends” both in the horizontal and in the vertical direction and “show 
evidence that outliers are present” . In this example we show how, using the 
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Figure 2: Aerial survey data (Haining, 1990). Hatched sites are those high- 
lighted by the forward search. 
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forward search algorithm, we can immediately detect nonstationary pockets, 
spatial outliers and throw light on the presence of the trend component. 

In order to choose the initial subset for the forward search we consider all 
possible blocks of size b = 4 with li = I 2 = 1. The best initial subset, 
according to criterion (3), is formed by locations S30, S31, S39 and S40. Figure 4 
monitors standardized prediction residuals as the subset size (r) increases. 
This plot clearly shows four dashed curves, corresponding to sites 854 , 855 , 873 



Figure 4: Aerial survey data. Standardized prediction residuals in each step 
of the forward search. 




r 



and S74, for which 6^(2^) > 3 in the central part of the forward search. Such 
units, which are the last to be included by the search, form a cluster in 
the bottom left corner of the grid and are surrounded by much smaller 
values (see Figure 2). Figure 5, which shows the curves of and 
clearly confirms that observations at these four sites can be considered as 
spatial outliers. Figure 4 also points out that from m = 64 onwards we 
always include units whose standardized prediction residual is positive. This 
group of observations has been represented with dotted lines in Figure 4. It 
is important to identify the actual location of these sites on the grid for 
several reasons: i) if they follow a systematic pattern they may denote the 
presence of a trend component; ii) if they are contiguous they may identify 
nonstationary pockets; in) if they are dispersed through the region they may 
represent local irregularities. In Figure 2 the positions corresponding to such 
units have been hatched lightly and are shown to belong to three specific 
zones of the grid. 

In conclusion, through our forward search algorithm we have easily discov- 
ered a cluster of spatial outliers in the bottom left corner and a pocket of 
nonstationary observations in the central left part of the grid. In addition. 
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two small local-scale irregularities seem to be present: the first in the north- 
ern central part and the second in the central right area of the grid. If we go 
back to the three-dimensional plot of the data we can see that the four areas 
highlighted by the forward search correspond exactly to the peaks which 
emerge from Figure 3. The new contribution of the forward search is that 
such peaks are shown to possess somewhat different spatial features and are 
thus to be treated in a different way. 



Figure 5: Aerial survey data. Curves of (a) and (b). 

(a) (b) 
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and Related Populations 
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Abstract: The present paper deals with likelihood classification rules for jointly 
assigning 713 individuals or one individual n 3 -times measured from a population 
II 3 to one of two populations 111 and II 2 . Several cases of completely or only 
partly known parameters of Gaussian or elliptically contoured overall sample dis- 
tributions are considered. Special emphasis will be on geometric representation 
formulae for different classification rules. Based upon such formulae it is possible 
to evaluate probabilities of correct classification explicitely as has been proved in 
Krause and Richter (1994a,b). 

1 Introduction 

Several classification rules are studied in mathematical and applied literature 
from various points of view. Basic results are included in numerous papers 
and monographs. Often both the likelihood and the Bayesian approaches are 
dealt with. Consideration in the present paper will be restricted to models 
with deterministic parameters. 

For standard reference books on classification it can be referred to Lachen- 
bruch (1975), Ahrens and Lauter (1981), Anderson (1984), Lauter and Pin- 
cus (1989), Lauter (1992), McLachlan (1992), Giri (1996) and Hand (1997). 
It seems that the case of repeated measurements, i.e. the case where several 
individuals are to be allocated at the same time into the same population, 
has not yet been studied systematically in the literature. Motivating ap- 
plications in this direction can be found, e.g., in Schaafsma and Van Vark 
(1977), McLachlan (1992) and in Giri (1996) as well as in other work cited 
therein. Related models are dealt with in Lauter (1992). 

The aim of the present paper is to study the problem of jointly assigning 
individuals or one individual n 3 -times measured from a population H 3 to one 
of two populations Hi and H 2 using likelihood ratio criteria. Various cases 
of known or unknown parameters as well as mixed situations are considered. 
It will be assumed in Section 2 that the p-dimensional measurements from 
Ui are distributed according to the Gaussian distributions with expectation 
vectors and regular covariance matrixes S^, i G {1,2,3}. Section 3 deals 
with elliptically contoured measurements. If not all moments from the pop- 
ulations Hi and H 2 are known then training samples of sizes rii and ri 2 will 
be drawn from it, respectively. 
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It is well known from the literature that in various situations it is difficult 
to evaluate explicitely probabilities of correct classification based upon the 
respective likelihood ratio classification rule (LRCR). It has been shown, 
however, in Krause and Richter (1994 a,b and (in preparation)) that the lin- 
ear model approach to the classification problem leads to decision rules which 
are of a geometrically well interpretable type. Based upon this approach it 
is possible to evaluate probabilities of correct classification explicitely us- 
ing a geometric measure representation of the Gaussian or another spherical 
distribution law. It will be shown in the present paper that these decision 
rules can be reformulated in such a way that their coincidence with or their 
difference to the respective LRCR’s becomes obvious. This is the general 
motivation for giving geometric representation formulae for LRCR’s in this 
paper. Note that optimality properties of the maximum likelihood classifi- 
cation rule (MLCR) which is a special LRCR has been studied in Das Gupta 
(1965). 

The first part of the present study of LRCR’s deals with the case that neither 
the expectation vectors nor the covariance matrixes from the populations Hi, 
II2 and II3 are known and Si / S2, in general. Concerning the resulting 
LRCR in Theorem 1 it should be emphasized that in this situation the 
likelihood ratio is not just the density ratio with numerator and denominator 
being simply the sample density from population Us but with moments taken 
once from Hi and the other time from II2 as it turns out to be in the case 
that the moments are known in Hi and II2. 

Afterwards it will be assumed additionally that Si = E2. The LRCR for 
this case has been dealt with by several authors in the above mentioned 
books and further papers cited therein, but sometimes without explaining 
the methodical background of this rule. It will be discussed here for some 
completeness, as well as some further rules which are devoted to different 
situations of more or less known parameters in Hi and II2. 

If the parameters jii and Sj of the population Ui are known for a certain i 
from {1,2,3} then it is natural not to draw a sample from the respective 
population. If, nevertheless, one assumes formally the presence of a sample 
from such a completely known distribution then it turns out that the LRCR 
suggests not to use this sample. One can start therefore the present study 
based upon the following model. 

Denote by the Gaussian distribution on Euclidean space with suitable 
dimension and expectation vector z/, covariance matrix A and with density 
The covariance matrix A is always supposed to be regular. Let 

Xij — ( Aijj , • • • ) ^pij^ 1 j I5 • • • 5 ZT/2 

be a sample of p-dimensional random measurements from population 11^ 
which are independent and identically -distributed random vectors, 
i = 1,2,3. Assume that //i and [12 are different vectors from and 



(/^35 S 3 ) ^ {(Mi? Si) , (/i2, S 2 )} , 
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i.e. ri3 can be considered as a copy of Hi or IT2, respectively. Put 

x« = = 1,2,3, x = 

and let and be independent. The vectors and can 

be interpreted as training samples of sizes rii and ri2 from the populations 
Hi and 112, respectively, and X^^^ is a vector of individual p-dimensional 
measurements from population Us which are jointly to be allocated into one 
of the populations Hi and 112. The experimenter wants to decide between 
the hypotheses 



^1/3 • — Ml ^nd H 2/3 : M3 — M2* 

Note that i/1/3 and i/2/3 are single hypotheses iff both fii = 112 and Ei = E2 
are assumed. If the density of a random vector Z is denoted by then 



p^[x) = X 



/ 

V X® 



G 



where n = ni + n2 + na. With the notation 

vT 



and 



it follows 



= diag(Ei, . . . , Ei) = Cov{X^^) € x R"^^ 



i=l 

Under Hi/s the overall density will be denoted by 



„ =: Lt/3(Mi,/X2,Si,S2) =: Li/ 3 , i e { 1 , 2 }. 

tii /3 



Note that these hypotheses-restricted likelihood functions allow the repre- 
sentations 



Li/ 3 = <^,,(1) v(i) 

^(ni-|-n3)’ (711-1-713) 



.(1) 



^(712)’ (712) 



(xW) 



if i/1/3 is true and 



T2/3 = 



^(txi)’^( 71 i) 



( 1 ) 



(x^^^) • ip 



^ ( 71 2 -f- 71 3 ) ’ ^ ( 71 2 -1- n 3 ) 
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if 7^2/3 is true. The maximum likelihood estimators (mle’s) for the model 
parameters based upon the hypotheses-restricted likelihood functions L1/3 
and L2/3 will be called ’’‘hypotheses-restricted mle’s’” and denoted by 



mle (/i/) 






/3 



and mle(S/) 

Hi/s 



It turns out from well known multivariate analysis in Anderson (1984) that 






otherwise 

\ 

where XT. E“ii X,„ X„, and 



1 p(V3) 






ni ni 



+nz if I ^ {h 3} 
otherwise 



with 



rS = f:(x„-x,.){x,i^x,.Y, 






ri:5!.. "EE (x.i - xi‘J’‘A (x„ - 

These formulae will serve as a unified starting point for deriving the different 
LRCR’s below. 

2 Classification for Gaussian Distributions 

In the case that the parameters /ii, /i2, Si, S2 of the Gaussian sample distri- 
butions are unknown the likelihood ratio Q is defined as 





Q = sup L2/3 




Mi5Ai2,Si,E2 


with 


T* — T , ( 

^1/3 — ^1/3 1 ? 


and 


^2/3 “ ^2/3 






^2/3/ ^1/3 



__l_p(l/3) J_p(2) 

, ni-\-ns 5 no 

Til + «3 



(2/3) 



-p(l) 

ni 5 



.(2/3) \ 
n2+ns I ' 



The LRCR rejects i^i/3 if Q > c for a suitably chosen threshold c. The 
LRCR equals the MLCR if for the critical value holds c = 1. In what follows 
|i4| and tr[A] denotes the determinant and the trace of a square matrix A, 
respectively. 
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Theorem 1 The likelihood ratio Q allows the representation formula 
Q/C{ni,n2,m)= 



rw + r(3) + ^^{Xu - x„){x^, - Xg.)- _ 

riV + r® + ^ (^ _ ^) (x^ _ ’ 



|(ni+n3)/2 



n2+n3 ' 



where 



(■+gr"'(i+s) 



C(ni,n2,n3) = 



n3p/2 



(i+sr^(i+s) 



ri3p/2 ' 



Proof It is well known that the single and joint population sample densities 
can be represented as 



P 

and 






(rfl) 



p 






exp {-i tr [EripW] - f tr [e-^ (x„ - (X^. - } 

= (27r)"iP/2|Ei|"</2 

= (27T)-("‘+"3W2|E,|-(”i+"3)/2 exp I - i tr [E-iriY?„3] 



:/3 



rii + ri3 



tr 



E-i (^(y3) _ 



respectively. Due to the relation tr[nlp] = np with the p x p - unit matrix 
Ip, it follows that 



and 



L\/s = 



T* — 
^2/3 — 



exp 



{ — ^} (ni + n3)("'+”3^P/^n; 



U2pI2 

'2 



exp { — ^} 



As a consequence, 

Q = C(ni, n 2 , ns)- 
Now, the following lemma applies 



r(2) 

-*■ 722 



”2/2 (1/3) 

-*■ ni+723 



( 721 + 723)/2 



.,(2/3) |("2+n3)/2| (i)ini/2- 



■ 722+713 



□ 
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Remark 2 Theorem 1 extends a result from Anderson (1984) for ns = 1 to 
the case of arbitrary 77,3. 

Lemma 3 Covariance decomposition formula: For i G {1,2}: 



p(V3) _ p(z) , p(3) 

n.+na n, + ^ n, + m 



(Xi,-X^,) {Xi,-Xs,f. 



Remark 4 



a) Based upon Corollary A. 3.1 in Anderson (1984), p. 594, Q can be 
written (for a regular matrix ) as follows 



p(2) p(l) , p{3) ("i+"3)/2 

■*■712 ■*■ Til ' ■*■ 713 



C{nu U 2 , ns) |p(l)|"C2 l ( 2 ) p(3)l("2+n3)/2 

P Til i 712 I J- 713 

„ (' + + rg)‘‘(xn- - x;:)) 

X ~ 

fi + 



(tii+ 713)/2 



(712+713)72 ‘ 



If ns = 1 then F® = O G x R^. 
b) Corollary A. 3.1 in Anderson (1984) yields another version of Q: 



rg> V" (i+vfrm- m) 



I \(ni+7i3)/2 



C{nu n2, ns) ^ ^2) 

+ ;f^(^ - ^)^(riV + F( 3 ))-^(Ai; - 
(1 + ^(^ - ^)^(Fl^) + Fi^3V'(^ - ^)) 

where yi , 2/2 are arbitrary vectors with yiyj = F),®) , for regular matrices 
r(i) F^^) 

It should be recalled that in Section 1 much motivation can be found for 
geometric reformulations of LRCR’s. For giving a first formula of this kind 
let p = 1 , 1 „. = ( 1 , . . . , 1 )^ G R"S 

1+00 ^ At qT P .0+0 ^ (qT .r qT p .00+ ^ (qT p 

■*- Til 7 ^712+713/ 5 ■*■ \^^71i ’ ■‘■712 ’ '^7l3y ’ \'^71i +712 ’ •‘'713 y ’ 

1+0+ _ 1+00 100+ 10++ ^ lO+O 100+ 

Consider the following subspaces of the sample space R" 

9 Hi/s = £(l+°+,l°+°) and OJls/s = £ (l°++, 1 +°°) , 




247 



where £(x, y) denotes the linear hull of the vectors x and y. Denote by 
the orthogonal projection of x into the linear space 971. Put 



ikii 



(1/3) 



= 



\ri2 






ni+7i3 



and 11 ^ 11 ( 2 / 3 ) 






n2+n3 



where || . || denotes Euclidean norm in the actual Euclidean space. Note that 
the functionals || . ||(j/ 3 ) do not define norms. 

Theorem 5 Geometric representation formula: If p = 1 then: 



Q 

C{ni, U 2 , ns) 



X - najt2/3 



X 



(1/3) 

(2/3) 



For an interpretation of this result let x denote a concrete sample vector and 



2) = £(a 



n 



sm 



1/3 



X, X 



n 






2/3 



a two-dimensional linear subspace of the sample space spanned up by 
the vectors standing within the brackets. The decision between Hi/^ and 
772/3 will be made within this two-dimensional space 2) which will therefore 
be called the decision space. 

If one assumes in the situation of Theorem 5 additionally that Si = S 2 one 
gets the following geometric representation formula 






X — 
X — 



which is a starting point in Krause and Richter (1994a,b and in preparation) 
for deriving exact formulae for probabilities of correct classification. Note 
that the decision will be made again in the same decision space as before 
but based now upon a likelihood ratio having a much simpler structure. 
The analytical representation formula for p > 1 and unknown /ii,/i 25 = 

S 2 has also a simpler structure than the result from Theorem 1: 



g2/n 



X + (Xi. - X3.) (Xu - X3.) 



where A = ZUi is a.s. regular if ni + ri 2 + na > p . 

A similar effect occurs for known /ii, // 2 , S 2 . Actually, the LRCR per- 

forms just like the well known quadratic classification rule if Si ^ E 2 and 
like the simpler linear classification rule if Si ~ S 2 . 
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A further interesting effect appears in the case of unknown //i, //2 but known 
El = S 2 S, say. In this case it holds 



^ • IB, = (rr - 

TI2 i" /^s ^ 

where W is the rule in Dorflo (1993) who dealt with the case — 1. The 
influence of the inbalancedness term 



IB — exp{ 



rii - ri2 

2(1 + ni/n 3 )(l + 712/713) 



(Xl.-X3.rS-'(Xi,-X3.)} 



vanishes only under balancedness, i.e. if ni = U 2 . Note that under rii = U 2 
the LRCR for unknown /Xi,/X 2 but known Ei = E 2 performs just like the 
LRCR for known iii,H 2 and Ei = E 2 if in the rule for the latter case fii and 
H 2 are replaced by the maximum likelihood estimators Xi, and X 2 ,, respec- 
tively. Recognize, however, that plug-in estimators derived from LRCR’s in 
cases of partly or completely known parameters are not LRCR’s for cases of 
completely unknown parameters, in general. 

Finally notice that the LRCR for unknown 1 x 1,112 but known Ei,E 2 and 
U 3 = 1 is 

Q = {\L,m2\f'\xp{-K*{X,)}, 

where K* modifies the quadratic decision rule K in Dorflo (1993) with re- 
spect to the additional factors ni/{rii + 1)\ 



ri2 


{x - X 2 ,] 


(x 


-^ 2 .; 


77,2 + 1 


ni 


{x-X;', 






77-1 + 1 



3 Elliptically Contoured Distributions 

In Section 2 classification was considered for repeated measurements in 
Gaussian populations. The present section deals with LRCR’s in ellipti- 
cally contoured populations. Put N = np and assume that the R^-valued 
random measurement vector X follows the elliptically contoured distribution 
law with parameters 
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a positive definite form matrix, and density generating function g | [0, oo) 
[0, oo) satisfying 

roo 

0 < / dr < oo. 

Jo 

The density of X is then 



= C'(A/',5)|E| ((a: - jifY. '^{x - g)) , x e R^, 

roo 

C{N, g) = r(iV/2)/(27r'^/2 / 

Jo 

Let 

q{r) = C{N,g)g{r), r > 0, 

denote a normalized version of the density generating function and assume 
that r^/‘^q[r) has a uniquely defined finite maximum point, say Uq{N). Put 

((x - gfE-\x - yu)) , 

then 

= Vn,T,-,g{x),X G 

and it follows from Giri (1996) or Mathai et al. (1995) that 

cov{X) = E{X- EX) {X - EXf = 

with 

Pn = J^N tr[z 2 :^] q{z^z) dz. 

Recognize that it is assumed here that the random sample vectors Xij are 
uncorrelated, j = 1, . . . , n^, ^ = 1, 2, 3 but different from the Gaussian case 
they are dependent from each other, in general. Nevertheless it is possi- 
ble to derive maximum likelihood estimators for the first and second order 
moments quite similar to the Gaussian case. This remains true if mle’s are 
considered under the restriction Hi/^ or i/ 2 / 3 - The general way of how to de- 
duce the mle’s of the expectation vectors and covariance matrixes in the case 
of elliptically contoured sample distributions from the corresponding mle’s 
in the Gaussian case has been described, e.g., in Anderson et al. (1990), 
in Giri (1996), 5.3.8. and in Mathai et al. (1995). Due to the respective 
calculation it follows that the mle’s for the expectations are precisely the 
same as in the Gaussian case and 

Here, mleg(.) denotes a maximum likelihood estimator if the overall sample 
vector X is governed by the elliptically contoured density 
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The unrestricted likelihood function in the present statistical structure is 

/ 3 n/ \ 

EE(^0- - {Xi, - w) • 

\(=U=1 J 

The hypotheses restricted likelihood functions are 

Li/3 (Pi,/X2,Si,E2) = 

ni U2 

?(tr[Er^ E E(%“/“i)(^d“Mir] + trp2“^E(^2j'-/^2)(a:2j-/^2)^]) 

;e{i,3}t=i j=i 



L 2/3 (pi,M 2 ,Si,E 2 ) = |Eir”'/ 2 |E 2 r(”^+”^)/ 2 - 

ni ni 

g(tr[Er^ E(^ii “ + trp2“^ E E(^(J - M2)(a;o- - ^ 2 )^])- 

j=i /e{2,3}i==i 

On combining these equations and the above representation formulae for the 
mle’s for the first and second order moments it follows 

■^1/3 = niax Li/ 3 (/ii,/r2,Si,E2)| 

' /Zl,//2,Sl,S2 1^1/3 

1 nn\ -("i+"3)/2 I -nift 

= 'imm |-rsj 



as well as 



^ 2/3 = L2/3(/ii,/r2,Si,E2) 

' /X1,/X2,Si,S 2 m2/3 

1 -ni/2 1 

= q{jJ,{N)) -r« — ^ — r 

IV 9 V ni n2 + U3 



-(n2+n3)/2 



Hence, for unknown //i, // 2 , Si, S 2 and overall density the likelihood 

ratio Qg allows the representation 

|p( 2)|"2/2| (1/3) (ni+n3)/2 

^ ^2/3 \ I ^2 I 1^ ni+ns 

~ ~ |p(2/3) |(^2+n3)/2 (i)ini/2- 

|^n2+n3| 

From the proof of Theorem 1 it becomes obvious that the following theorem 
has thus been proved. 

Theorem Q If Q denotes the likelihood ratio from Section 2 then 

Qq = Q. 
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Abstract: We describe a computer intensive method for linear dimension re- 
duction which minimizes the classification error directly. Simulated annealing 
(Bohachevsky et al. (1986)) is used to solve this problem. The classification error 
is determined by an exact integration. We avoid distance or scatter measures 
which are only surrogates to circumvent the classification error. Simulations (in 
two dimensions) and analytical approximations demonstrate the superiority of 
optimal classification over the classical procedures. We compare our procedure to 
the well-known canonical discriminant analysis (homoscedastic case) as described 
in McLachlan (1992) and to a method by Young et al. (1987) for the heteroscedas- 
tic case. Special emphasis is put on the case when the distance based methods 
collapse. The computer intensive algorithm always achieves minimal classification 
error. 



1 Introduction 

Classification deals with the allocation of objects to g predetermined groups 
G = {1,2, .. . ,g}, say. The goal is to minimize the misclassification rate 
over all possible future allocations, characterized by the conditional densities 
Pi{x) ,2 == 1,2, ... ,^; a: G The minimal error is the so-called Bayes 
error (McLachlan (1992)). Often we want to reduce the dimension of the 
classification problem to one or two dimensions in order to support human 
imagination without significantly increasing the misclassification rate. This 
article deals with linear combinations of the original variables to achieve this 
goal: Linear Dimension Reduction. The next section reviews the classical 
approach based on distance measures and presents the idea of Young et 
al. (1987) in a way that facilitates such a distance formulation. Section 3 
introduces computerintensive dimension reduction and simulated annealing. 
Section 4 compares the classical and the computerintensive method. 



2 Classical Linear Dimension Reduction 

The intuitive idea is to project the data in a way that maximizes the distance 
between the groups (hopefully this will also minimize the misclassification 




253 



rate). The distance measure relates the between-group scatter matrix 

9 9 

Sb = P)' ; p = Y2p{i)pi ( 1 ) 

i=l i=l 

to the pooled within-group scatter matrix 

Sw — (2) 

i=l 

where p{i ) , z = 1, 2, . . . , denote the a priori probability of the different 
groups, iii their means and their covariance matrices. The maximal rank 
of is ^ — 1. In order to project on the ’’best” direction a G we 
maximize the quotient 



a'SBa 

a'Swa ^ ’ 

by variation of a. The maximum is attained at the eigenvector vi cor- 
responding to the largest eigenvalue Ai of 
If p{i) — Ijg formula (3) is equivalent to 



Ai 






a' Sbcl 
q! 



1 

9 



9 

where 

i=l 



(a'5wa)V2 



and 



a' pi 



(4) 

(5) 



This expression is easier to analyze. 

An idea of Young et al. (1987) incorporates different covariance matrices for 
different groups. First we build the matrix 



M = [/Z2 - /ill • • • \pg ~ /ii|S2 - Si| • • • |S^ - Si], (6) 

where M E A4(d,s) , s = (g — l)(d+ 1) by juxtaposition of the vectors and 
matrices. We assume Sj Si for at least one i E G. 

The analogon to (4) is 



Aj ;= a'MM'a = - Ci))" + E E(“'(S| “ ^{)f, (7) 

i=2 i—2 j—l 

" " ^ 

mean portion covariance portion 

where S] denotes the jth column vector of the zth covariance matrix. The 
term is divided into a pure mean portion and a pure covariance portion. 
Further directions can be calculated by means of the eigenvectors of S^Sb 
and MM' respectively. 
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3 Computerintensive Dimension Reduction 
and Optimization 

3.1 Computerintensive Dimension Reduction 

This section applies simulated annealing to linear dimension reduction. The 
algorithm optimizes the entries in the matrix R € M{e,d) which is used 
for the projection on the lower dimensional space. More specifically, we 
minimize the misclassification rate /(r): 

Minimize/: M{e,d) — > R"'' (8) 

g . 

PR,idx, 

i=l 

where r is the vector of entries in R and Bi = {x \ 3j : PR^i{x) < Prj{x)}, 
PR^i equals the i-th projected group density and e and d denote the dimen- 
sion of the lower dimensional space and the original one, respectively. The 
integration is performed by numerical quadrature procedures. 

3.2 Simulated Annealing 

We use an implementation of the simulated annealing algorithm based on a 
routine in Numerical Recipes in C Press et al (1992). The basis of this 
routine is the search algorithm of Nelder and Mead, see also in Press 
et al. (1992). This algorithm encloses the optimum by shrinking simplices. 
The shrinking is proportional to a sequence — > 0 , n oo. Therefore, 
as Tn approaches zero, only more and more local movements are allowed and 
the algorithm converges to the next optimum. 

For a fixed Tn the algorithm generates a Markov chain with transition 
function g(r^,rp) corresponding to a stochastic version of the moves of the 
Nelder and Mead algorithm. 

The trial point is accepted with probability 7r(rp)/7r(ri) — exp(— (/(r^) — 
f{ri))/Tn). In this way matrices R leading to a decrease of misclassification 
are accepted in any case, but also matrices increasing the error rate are ac- 
cepted with some probability. This is the reason, why simulated annealing 
is able to overcome local optima and thus avoids the selection of multiple 
starting points. This makes simulated annealing useful even in lower dimen- 
sions. 

After a number of steps in the markov chain, the temperature will be de- 
creased, Tn-\-i = aTn (0 < D < 1), and a new chain will be created. 

The parameter ranges we have chosen are: temperature Tq = 5.0% — 50%, 
a = 0.8 — 0.98 and number of iterations in the markov chain at each tem- 
perature: 50 — 500. 

This computerintensive method achieves minimal misclassification error if 
adequately implemented. 
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4 Comparison with the Classical Approach 

The optimization algorithm introduced in section 3 is now compared to 
the classical approach. The classical procedures do not provide a direct 
link to the misclassification rate (that is, from a small perturbation of the 
direction a, you can not analytically derive the corresponding variation in the 
misclassification rate). In fact, in some special cases (depending on grouping, 
form of the covariance matrices), a significant difference between the two 
procedures can be detected. Apart from the pure comparison, emphasis is 
put on the question when distance based ’’analytical” methods collapse. In 
these cases only the algorithm in section 3 supplies valid results. 

4.1 Equal Covariance Matrices and g — 3 Groups 

Let Ei = E for all i == 1, . . . , ^. In formula (4), assume that there is one 
group i with 



Wni - ^l\\ » Wfij - n\\ 'ij . (9) 

Then the sum has one dominant term which is maximized at the cost of the 
other summands, because the distances in (4) are squared. Therefore we get 
the approximation 



Ai 



a'SBd 
o! S\yOj 









Maximization under the constraint ||a|| = 1 yields the value 



-{gi — fjLY'E — attained at aaE ^(/ij — //). 
9 



( 10 ) 



( 11 ) 



Henceforth, we project on a direction that is dominated by fii. The other 
means are only incorporated by ]1. This behaviour leads to suboptimality. 
To get a better understanding, we conduct some simulations. First, we 
transform the common covariance matrix E by the transformation Xnew 

to the identity matrix /^. This does not increase the misclassification 
rate. Because of the symmetry induced by three groups, it suffices to take 
d = 2. Therefore we set 



;xi = (0,0)', H 2 = {2,0y and iJ ,3 = {x,y)'. (12) 

Mean gi only determines the origin and 112 is somewhat arbitrary. A vari- 
ation of would only alter the misclassification level, not the qualitative 
conclusion. The third mean contains two variables x and y. This two di- 
mensional surface can be conveniently plotted. Once again because of the 
symmetry of the grouping, it is enough to regard the positive quadrant. We 
take the range 0 < :r < 2.5 and 0 < y < 2.5. 
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Figure 1: Misclassification rate classical procedure. 
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Figure 2: Misclassification rate optimized procedure. 
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Figures 1 and 2 show the misclassification rates of the classical and the op- 
timized procedure, respectively. Note the different scales of the two graphs. 
The results of the classical procedure are qualitatively similiar in the ’’front” 
range (0 < x < 2.5 and 0 < y < 1.7), whereas there is a significant difference 
in the ’’back”. We now analyze the reason of the ’’mountain ridge” in the 
classical case in more detail. To achieve this goal, we calculate Sb- 
A special situation arises, if the means of the three groups constitute a 
regular triangle. For that reason, we reparametrize the third mean: ys = 
(1 -1- 6x, \/3 -I- 6y). Then we have 

c ^ VSSx + Sx5y\ /, o\ 

^ 9 \\/3(5a; -I- 6x6y {6y -h -v/3)^ J 

The special case = 0 yields 

^ = + (14) 

and after maximization we get the following distinction of cases: 

{Sy + > 3 Jy > 0 or 5y < — 2\/3 => ai = 0,02 = 1 (15) 

{5y + V3f<3 -2\/3 < <5y < 0 => ai = l,O 2 = 0 

{5y -I- \/3)^ = 3 6y = 0 Oi, 02 arbitrary 

The mean yis — (1, v^)' results in a singularity (projection vector a = 
(oi, 02 )' not defined). But this mean is realized with probability zero by the 
empirical mean value and is therefore unimportant. But important is the 
fact that the projection behaviour ’’turns over” at this value. Up to 6y < 0, 
the projection is onto the x-axis (like with the optimized procedure), then 
onto the y-axis. This causes a higher misclassification rate compared to the 
optimized procedure, because the projected first group coincides with the 
second one, while the optimized method still projects onto the x-axis. 

The classical approach even more often fails for more than y = 3 groups, 
because there are more critical groupings. 

4.2 Unequal Covariance Matrices and g = S Groups 

The central formula (7) 

A 2 := a! MM' a = ^ ^(a'(El - Sj))^ (16) 

2=2 i=2 j=l 

yields more possibilities for dominant terms than (4). 

For example, assume for one group z, 



(17) 
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Figure 3: Misclassification rate classical procedure. 



then only the jth column in the ith group differs substantially from the 
pendant in the first group. Thus we get 

Aj » (o'(E> - E'))^ (18) 

The inequality of Schwarz supplies the maximum at 

aoc(Ef-Ei). (19) 

This solution uses only a small part of the available information and we 
therefore get - once again - a difference between the optimized and classical 
solution. 

A small simulation study in two dimensions demonstrates the key issue. The 
means are now fixed at 

Ml = (0,0)', //2 = (2,0)' and nz = (1, 1)'. (20) 

The covariance matrices are 

El = /2, T ,2 = h and E3 = diag(l + x, 1 + y). (21) 

We take again the range 0 < x < 2.5 and 0 < y < 2.5. This time, the 
graphical representation does not display the whole grouping, but in a more 
abstract manner the variance of the third group. The figures 3 and 4 plot 
the misclassification rate of the classical and optimized procedure. 

The differences are significant, especially if y is large and x small. In this 
case, the classical method projects onto the y-axis and the first and second 
group collapse. The optimized procedure still projects onto the x-axis. 

In a concrete application, it is useful to compare the classical procedure with 
the optimized method in one dimension. If the results differ significantly, 
we have to use the optimized approach in higher dimensions (even if the 
computational burden is higher), otherwise we use the idea of Young et al. 
(1987), if the covariance matrices are unequal (especially if d! > 2). 
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Figure 4: Misclassification rate optimized procedure. 



5 Conclusions 

After we introduced the classical discriminant analysis based on scatter ma- 
trices, we discussed a less well-known approach of Young et al. (1987) which 
we have reformulated using a distance measure. 

These classical procedures were compared to an optimized procedure based 
on simulated annealing by means of simulations and analytical approxima- 
tions. The differences and drawbacks of the classical approach were dis- 
cussed in detail. The differences for more than two groups can be severe. It 
is exactly this case that is mainly ignored in the literature. 

This article clearly demonstrates the power of computerintensive methods. 
They help the statistician to concentrate on the real problem at hand: here 
the minimization of the misclassification rate. 
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Abstract: Noisy data recorded from ion channels can be adequately modelled by 
hidden Markov models with a finite number of states. We address the problem 
of testing for the number of hidden states by means of the likelihood ratio test. 
Under the null hypothesis some parameters are on the boundary of the parameter 
space or some parameters are only identifiable under the alternative, and there- 
fore the likelihood ratio tests have to be applied under nonstandard conditions. 
The exact asymptotic distribution of the likelihood ratio statistic cannot be de- 
rived analytically. Thus, we investigate its asymptotic distribution by simulation 
studies. We apply these tests to data recorded from potassium channels. 



1 Introduction 

Hidden Markov models are a powerful tool to model data whenever there 
is an unobservable discrete state dynamics which governs the statistical 
properties of an observable, see Elliott et al. (1995) and MacDonald, Zuc- 
chini (1997) for recent reviews. 

These methods are applied to ion channel data. Ion channels are macro- 
molecular pores in cell membranes. Typically, an ion channel switches 
rapidly between several geometrical conformations but has only two conduc- 
tance levels, i.e. open and closed. The dynamics of conformational changes is 
believed to be Markovian. Conformational changes, however, cannot be ob- 
served directly, but it is possible to measure the current through a single ion 
channel with the so-called patch-clamp technique (Sakmann, Neher (1995)). 
The ion current is of order 10“^^ Ampere and is covered by observational 
noise with a signal-to-noise ratio down to 1/2 . 

The conventional analysis of such data consists in low pass filtering the data 
until the different conductance levels are visible. By using a threshold to 
discriminate between the open and the closed conductance level, the fil- 
tered data are reduced to a binary valued time series, the so-called idealized 
record. This idealized record can be treated as an aggregated Markov process 
(Colquhoun, Sigworth (1983)). Based on the idealized record the following 
analysis procedure is commonly applied: In the first step, open and closed 
dwell time histograms are obtained. Due to the Markovian dynamics, the 
dwell time distributions are a mixture of exponential distributions and the 




261 



number of closed and open states can be inferred from the number of com- 
ponents in the mixture distribution. Based on this information about the 
number of states, aggregated Markov models are fitted to the data in the 
second step, allowing for an analysis of the dynamics (Horn, Lange (1983)). 
The low pass filtering does not only reduce the noise, but also disturbs the 
fast background dynamics of the process heavily, resulting in the so-called 
“missed event” problem (Blatz, Magleby (1986)). 

Therefore, hidden Markov models have been introduced to incorporate the 
observational noise into the model (Chung et al. (1990), Chung et al. (1991)). 
In the framework of hidden Markov models it is not possible to infer the 
number of states by means of dwell time histograms of low pass filtered 
data. Different numbers of states correspond to different model classes and 
for this reason we formulate this problem as a statistical test. 

In hidden Markov models parameters are estimated by the maximum like- 
lihood method and, thus, we apply likelihood ratio tests to decide between 
different model classes. In the present case, however, these tests have to be 
applied under nonstandard conditions. First, some parameters are on the 
boundary of the parameter space under the null hypothesis. Second, some 
parameters may only be identifiable under the alternative. 

We investigate the distribution of the likelihood ratio under these nonstan- 
dard condition by simulations and evaluate the power of the test. Finally, 
we apply the proposed procedure to data recorded from ATP sensitive 
channels. 



2 Theory 

A hidden Markov model is composed of a Markov chain whose states are 
unobservable, and an observed process Yt such that (Yi, . . . , Yt) is condition- 
ally independent given (Xi, . . . ,Xt). We consider a discrete time Markov 
chain with a finite number of states. These states correspond to the geomet- 
rical conformations of the ion channel protein. In the present case we do not 
allow for an arbitrary transition probability matrix A to describe the hidden 
dynamics, but only for a transition probability matrix A which is obtained 
by exponentiation of a generator matrix Q: 

A = exp{QAt) ( 1 ) 

where At is the sampling time and Q = {qij) satisfies the ordinary constraints 
for generators: Y^jQij = 0, qij > 0 for i 7 ^ j. Equation ( 1 ) is a natural 
parameterization of the hidden dynamics of an ion channel, because the 
physics of the switching between the several conformations of the channel 
are described by transition rates (Albertsen, Hansen (1994)). 

Due to the geometry of the macro protein, an ion channel cannot switch 
from every conformation to every other conformation, but has to follow 
certain transition paths. Thus, the background dynamics of an ion channel 
is characterized by the allowed transitions between the states. Disallowing 
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certain transitions can be achieved by setting the corresponding entries in 
the generator matrix Q equal to zero. 

In each state an ion channel is either open and ions can pass the channel 
or closed and the channel blocks the ion flow. Therefore, we label the back- 
ground states with C and O for a closed respectively for an open state. The 
distributions of the observed process Yt conditioned on the background state 
Xt are modelled by Gaussians with mean jic and variance for a closed 
state and mean /Iq and variance for an open state. 

For a given class of hidden Markov models parameters are estimated by the 
maximum likelihood method, which is discussed in detail in Fredkin, Rice 
(1992), Albertsen, Hansen (1994) and Michalek, Timmer (1997). 
Physiological background knowledge can often be used to reduce the number 
of possible model classes for a given ion channel. In the simplest case it is 
possible to reduce the problem to two nested model classes, F C 0, with 
different number of states. Thus, we formulate the following hypotheses: 

Ho : true model G F 

Hi : true model G 0 \ F (2) 

Since the nesting of model classes is achieved by fixing one or more entries 
in the generator matrix Q to zero and transition rates are required to be 
nonnegative, the model class with fewer states is always part of the boundary 
of the parameter space. Besides, it is possible that the likelihood function 
does not depend on some transition rates under the null hypothesis, so that 
these transition rates are only identiflable under the alternative. An example 
is given in Fig. 1. 

In Sec. 2.1 we discuss the likelihood ratio test under these circumstances. 




Figure 1: Example of two model classes of hidden Markov models. The 
arrows indicate the allowed transitions. F is nested in 0 by requiring 6 = 0. 
Furthermore, the rate a is only identiflable, if 6 > 0. 



2.1 Likelihood Ratio Tests 

The classical result that the twofold log-likelihood ratio is distributed asymp- 
totically as with the number of degrees of freedom given by the difference 
of the numbers of parameters of the model classes, especially depend on the 
following five assumptions (Cox, Hinkley (1974), Vuong (1989)): 



• the model classes are nested, 
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• the model classes are not misspecified, 

• the maximum likelihood estimators are asymptotically normally dis- 
tributed, 

• the true parameters are not part of the boundary of the parameter 
space, 

• all nuisance parameters are identifiable under the null hypothesis. 

In our case the last two assumptions are not fulfilled and the likelihood ratio 
tests are applied under nonstandard conditions. 

Parameters on the boundary under the null hypothesis For iid 

random variables a general characterization of the asymptotic distribution 
of the twofold log-likelihood ratio was derived by Self, Liang (1987). They 
obtain analytical results in special cases. Especially, if only one parame- 
ter is restricted to be on the boundary of the parameter space under the 
null hypothesis, the twofold log-likelihood ratio asymptotically follows the 
distribution: 

2^1 + 2^0 ( 3 ) 

instead of the classical result Xi- 

Non-identifiable parameters under the null hypothesis For this case 
analytical results are not known. This is mainly due to the fact that the 
log-likelihood can no longer be approximated by a parabolic function in 
the neighborhood of the true parameter value. Hansen (1992) proposed a 
procedure to obtain upper bounds for the quantiles of the log-likelihood 
statistic for iid random variables based on the covariance function of the 
standardized likelihood ratio process. Since in the case of hidden Markov 
models there is no estimator known for this covariance function, this result 
is not easily extendible to hidden Markov models. Therefore, we obtain an 
approximation of the distribution of the log-likelihood ratio by a parametric 
bootstrap procedure (Davison, Hinkley (1997) and Efron, Tibshirani (1993)): 
Given the data, we estimate the parameters under the null hypothesis and 
draw new independent time series of the same length as the real data set from 
the parametric estimate of the distribution of the hidden Markov process. 
We derive an approximation of the distribution of the log-likelihood ratio by 
calculating the test statistic for each simulated time series. 



3 Simulations 

In this section we investigate the asymptotic distribution of the log-likelihood 
ratio statistic under the null hypothesis and the power of the test by simu- 
lations. 
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Figure 2: Distribution function and q-q plot. The distribution function of 
the twofold log-likelihood ratio, obtained by simulating 2299 time series, 
is plotted as a solid line in comparison with the xi distribution (dashed 
line) and with the mixture distribution |x 2 + |xo (dotted line). The q-q 
plot indicates that the simulated distribution is consistent with the mixture 
distribution |xi + 2 ^ 0 - 



As an example we assume a true model with one open state and two closed 
states. This example is motivated by the application to real data, we present 
in Sec. 4. The model parameters correspond to the estimated parameters 
under the null hypothesis of the real data of Sec. 4 and the time series in 
all simulation studies had the same length as the measured data set, 131072 
data points corresponding to a measurement of ~ 5 s with 25 kHz sampling 
rate. The rani pseudo-random number generator was used in all simulation 
studies (Press et al. (1992)). The number of calls to rani never exceeded 
0.2% of the order of the pseudo-random number generator’s period. 

The hypotheses are formulated as follows (see Fig. 1): 

Ho : b = 0 

Hi : 6 > 0 (4) 

Under the null hypothesis the transition rate b is part of the boundary of 
the parameter space and the rate constant a is not identifiable. 

The simulation study indicates that under the null hypothesis the twofold 
log-likelihood ratio statistic is distributed asymptotically as ^xl + ^Xo 
Fig. 2). The 95% quantile is approximately 4.58. 

The null hypothesis is violated if the true model has two open states instead 
of one, corresponding to a transition rate 6 > 0. A further simulation study 
investigates the power of the likelihood ratio test for the given length of 
the data set against alternatives b > 0. Data sets were simulated with 
6 = 8, 16, 33, 66, 133 s~^ and models under the null hypothesis and the 
alternative were fitted to these data sets. Based on these fitted models, the 
power of the test was evaluated for a test to 5% level and plotted in Fig. 3. 
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Power function of a test to 5% level 




Figure 3: The power function of a test to 5% level against alternatives 6 > 0 
for the given length of 131072 data points is shown. For each value of the b 
the calculation of the power was based on 1299 replications. 
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Figure 4: K~^ currents through cardiac ATP-sensitive potassium channels. 
The currents are measured in arbitrary units of the AD-converter. The data 
set has a length of 131072 data points, the sampling rate was 25 kHz. In 
the upper level the ion channel is closed. A subset from 0.7 s to 0.95 s of the 
whole data set is shown on the right. 



4 Application to Measured Data 

The measured data are currents through cardiac ATP-sensitive potas- 
sium channels (Haverkamp et al. (1995)) (see Fig. 4). Due to physiological 
considerations the true model should have two closed states and one or two 
open states arranged in a linear chain (see Fig. 1). For this case the hy- 
potheses are formulated in equation (4). By means of maximum likelihood 
estimation two models under the null hypothesis and under the alternative, 
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respectively, were fitted and the twofold log-likelihood ratio was obtained: 
‘^{Loi^02^Ci^C2 ~ = 80 » 4.58 (5) 

Applying our result from Sec. 3, the null hypothesis that there is only one 
open state, has to be rejected. 



5 Discussion 

We discussed a special family of hidden Markov models which is suitable to 
describe ion channel data, and we have shown that the likelihood ratio test 
for testing for the number of states has to be applied under nonstandard 
conditions. 

In the physiological literature likelihood ratio tests are used to select between 
different model classes, mostly in the case of aggregated Markov models 
(Horn et al. (1984), Kienker (1990)) without taking into account, that the 
likelihood ratio statistic does not follow a distribution with an appro- 
priate number of degrees of freedom under nonstandard conditions. As a 
consequence thereof the test is too conservative and thus it looses its power 
to detect violations of the null hypothesis. 

However, there is not a general theory known for the asymptotic distribution 
of the log-likelihood ratio statistic under these nonstandard conditions for 
hidden Markov models, up to now. Therefore, we proposed a parametric 
bootstrap procedure to obtain the correct quantiles of distribution of the 
twofold log-likelihood ratio statistic. 

Acknowledgment 

We thank Prof. M. Kohlhardt, Physiologisches Institut, Universitat Freiburg, 
for kindly providing the ion channel data. 



References 

ALBERTSEN, A. and HANSEN, U.-P. (1994): Estimation of kinetic rate con- 
stants from multi-channel recordings by a direct fit of the time series. Biophysical 
Journal, 67, 1393-1403. 

BLATZ, A.L. and MAGLEBY, K.L. (1986): Correcting single channel data for 
missed events. Biophysical Journal, 49, 967-980. 

CHUNG, S.-H., MOORE, J.B., XIA, L., PREMKUMAR, L.S. and GAGE, P.W. 
(1990): Characterization of single channel currents using digital signal processing 
techniques based on hidden Markov models. Philosophical Transactions of the 
Royal Society London B, 329, 265-285. 

CHUNG, S.-H., KRISHNAMURTHY, V. and MOORE, J.B. (1991): Adaptive 
processing techniques based on hidden Markov models for characterizing very 
small channel currents buried in noise and deterministic interferences. Philosoph- 
ical Transactions of the Royal Society London B, 334, 357-384- 




267 



COLQUHOUN, D. and SIGWORTH, F. J. (1983): Fitting and statistical analysis 
of single-channel records. In: Sakmann, B., Neher, E., (Eds.) Single- Channel 
Recording, 4^S-585, Plenum Press, New York, London. 

COX, D.R. and HINKELY, D.V. (1974): Theoretical Statistics. Chapman & Hall, 
London. 

DAVISON, A.C. and HINKLEY, D.V. (1997): Bootstrap Methods and their Ap- 
plication, Cambridge Series in Statistical and Probabilistic Mathematics, Cam- 
bridge University Press. 

EFRON, B. and TIBSHIRANI, R.J. (1993): An Introduction to the Bootstrap. 
Monographs on Statistics and Applied Probability, 57, Chapman & Hall, London. 

ELLIOTT, R.J., AGGOUN, L. and MOORE, J.B. (1995): Hidden Markov models 
— Estimation and Control. Springer- Verlag, New York, Berlin, Heidelberg. 

FREDKIN, D.R. and RICE, J.A. (1992): Maximum likelihood estimation and 
identification directly from single-channel recordings. Proceedings of the Royal 
Society London B, 249, 125-132. 

HANSEN, B.E. (1992): The likelihood ratio test under nonstandard conditions: 
testing the Markov switching model of GNP, Journal of applied econometrics, 7, 
61-82. 

HAVERKAMP, K., BENZ, I. and KOHLHARDT, M. (1995): Thermodynami- 
cally specific gating kinetics of cardiac mammalian K^^p^-channels in a physio- 
logical environment near 37°C, Journal of Membrane Biology, I 46 , 85-90. 

HORN, R. and LANGE, K. (1983): Estimating kinetic constants from single 
channel data. Biophysical Journal, 43, 207-223. 

HORN, R., VANDENBERG, C. and LANGE, K. (1984): Statistical analysis of 
single sodium channels. Biophysical Journal, 45, 323-335. 

KIENKER, P. (1990): Equivalence of aggregated Markov models of ion-channel 
gating. Proceedings of the Royal Society London B, 236, 269-309. 

MACDONALD, I.L. and ZUCCHINI, W. (1997): Hidden Markov Models and 
Other Models for Discrete-valued Time Series, Chapman & Hall, London. 

MICHALEK, S. and TIMMER, J. (1997): Estimating rate constants in hidden 
Markov models by the EM algorithm, accepted in IEEE Transactions on Signal 
Processing. 

PRESS, W.H., TEUKOLSKY, S.A., VETTERLING, W.T. and FLANNERY, 
B.P. (1992): Numerical Recipes in FORTRAN, Second Edition, Cambridge Uni- 
versity Press. 

SAKMANN, B. and NEHER, E. (1995): Single- channel recording. Plenum Press, 
New York, London. 

SELF, S. G. and LIANG, K. Y. (1987): Asymptotic properties of maximum likeli- 
hood estimators and likelihood ratio tests under nonstandard conditions. Journal 
of the American Statistical Association, 82(398), 605-610. 

VUONG, Q.H. (1989): Likelihood ratio tests for model selection and non-nested 
hypotheses, Econometrica, 57(2), 307-333. 




ClustanGraphicsS 

Interactive Graphics for Cluster Analysis 

David Wishart 

Department of Management, University of St. Andrews 
St. Katharine’s West, The Scores, St. Andrews, Fife KY16 9AL, Scotland 

Email: D.Wishart@St-Andrews.ac.uk Website: www.clustan.com 
Tel: +44-131-337-1448 

Abstract: ClustanGraphics3 is a new interactive program for hierarchical clus- 
ter analysis. It can display shaded representations of proximity matrices, dendro- 
grams and scatterplots for 11 clustering methods, with an intuitive user interface 
and new optimization features. Algorithms are proposed which optimize the rank 
correlation of the proximity matrix by seriation, compute cluster exemplars and 
truncate a large dendrogram and proximity matrix. ClustanGraphics3 is illus- 
trated by a market segmentation study for automobiles and a taxonomy of 20 
species based on the amino acids in their protein cytochrome-c molecules. The 
paper concludes with an overview. 



1 Hierarchical Cluster Analysis 

The paper develops work first reported by Wishart (1997), which was influ- 
enced by Gale et al. (1984), and represents a modernization and development 
of Clustan (Wishart (1987)) for Windows 95, 98 and NT. 
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Figure 1: Shaded representation of a proximity matrix for 9 cases 




A proximity matrix which is read or computed by ClustanGraphicsS can be 
displayed in shaded form, as in Fig. 1 which illustrates a distance matrix 
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for 9 automobiles. Heavy shading is used to denote high similarities, or 
small distances. The proximity matrix is usually symmetric since sik = Ski 
is usually true, where sik is the proximity of case i with respect to case k. 
The matrix is next clustered hierarchically to obtain a dendrogram which 
can be displayed in shaded form, as in Fig. 2 (see Gordon (1996) for a recent 
review of hierarchical cluster analysis). Two colour shading is used to differ- 
entiate within-cluster and between-cluster groupings, as in this classification 
of 3 clusters which was selected by pointing and clicking the mouse along 
the 3-cluster partition. ClustanGraphics3 allows dendrograms to be easily 
formatted, re-sized, truncated or pruned, and images can be magnified or 
compressed - essential features when analysing thousands of cases. 




Figure 2: Dendrogram for 9 cases, obtained using Ward’s Method 



2 Seriation Algorithm 

The order of the cases in a dendrogram is quite arbitrary, being dependent 
upon the order in which the cases are presented to the clustering program. 
There are up to 2^~^ ways of arranging any dendrogram of n cases. Even 
for this small example, involving only 9 cases, there are 128 different ways 
of displaying the dendrogram by reversing the order of the cases at each of 
7 fusion steps (reversing the order of the 8^^ fusion step merely inverts the 
dendrogram). 

The graphical presentation of a dendrogram and its underlying proximity 
matrix can generally be improved by seriation, or optimal re-ordering of 
the cases. Some seriation methods have been considered by Gale et al. 
(1984). A suitable seriation method should have certain desirable features. 
Firstly, it should work equally well for similarities and dissimilarities. It 
should also produce the same results for the same dendrogram, even when 
the data or proximities are subjected to linear transformations of the form: 

— axij -f b 

^ik — b 
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where xij are the observations used to compute proximities or Sik are the 
computed or observed proximities, and a and b are transformation factors. 
This leads us to propose optimizing the rank order of the proximities, 
because the ranks are invariant under linear transformations of the data or 
proximities. Our proposed algorithm seeks an optimal re-arrangement of the 
proximity matrix such that the rank order of the proximities by rows is as 
close as possible to the perfect rank ordering, as follows: 

012345 .. . 

101234 .. . 

210123 .. . 

321012 .. . 

432101 .. . 

We first replace the proximities by their row-wise ranks aik and compute a 
variant of Spearman’s p, or the rank correlation between the actual row-wise 
ranks aik of the proximity matrix and the perfect ranks pik, as follows: 






where aik is the actual rank of the proximity sik^ and pik is the target rank 
in a perfectly ordered proximity matrix. The seriation objective is therefore 
to maximize the correlation p. 

Upon expanding eqn. (1) the terms involving n, Y.iY.k^^k YliT^kPlk 
are additive or multiplicative constants for a proximity matrix in any order, 
hence p = Ci — C 2 Yli ^k (^ikPik where ci, C 2 are constants for a given prox- 
imity matrix. Maximum p is therefore achieved by minimizing the function 
f—ili Y^k ^ikPik- The algorithm to maximize p proceeds as follows: 

1. Order the proximities row- wise to obtain the matrix of their ranks aik 
and compute the objective function f=Yi Yk ^ikPik- 

2. For each of the first n-2 fusion steps, reverse the order of the cases in 
the resulting cluster. 

3. Re-calculate the matrix of ranks a*ik and compute the value of the 
revised objective function /*=Ei Y,k(^ikPik- 

4. If /* < / then the new case order increases p, so we confirm the new 
case order examined at step 2 and adopt as the new matrix of 
ranks; otherwise retain the original case order and ranks. 

5. Repeat steps 2-4 until there are no further reductions in the objective 
function /* in a complete iteration, and hence no further improvements 
to the serial order of cases in the dendrogram. 
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A computational problem with this algorithm is that step 3 requires new 
ranks to be calculated at each of the n-2 fusion steps in each iteration, 
which is a lot of computational work for a large proximity matrix. We 
have therefore considered minimizing f—Y^i Yk an alternative ob- 

jective function (Wishart (1997)). This has the advantage that the matrix 
of squared Euclidean distances does not have to be re-computed at step 
3, and the optimization algorithm is therefore much faster. 

In practice, we have implemented optimization of both Yi Yk ^ikPik a^nd 
YiYk^lkPik have found no difference in the results. For all the data 
sets and hierarchical cluster analyses we have tested, optimizing both func- 
tions produced the same order of the proximity matrix and tree; however, 
optimizing Yi Yk djkPik was much faster than for Yi Yk o^ikPik- 
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Figure 3: Proximity matrix optimally re-ordered by seriation 



For the example in Fig. 1, the seriation algorithm to optimize Yi Yk ^IkPik 
converged after 3 iterations, resulting in an optimal re-ordering of the prox- 
imity matrix as illustrated in Fig. 3. After seriation, the shaded distance 
matrix exhibits much stronger clustering near the diagonal. In particular, 
the associations Audi, BMW, Volvo, Renault and Vauxhall, Rover, Toy- 
ota are now visually more evident, and the serial case order is more easily 
interpreted. 

The re-ordered tree is shown in Fig. 4, where the 3 cluster partition has 
again been selected for detailed study. The cases underlined are cluster 
exemplars (see below). 



3 Cluster Exemplars 

ClustanGraphicsS can also exemplify clusters. This involves finding the 
most typical member of each cluster, or its ’’exemplar”. We have chosen 
the maximum within-cluster average similarity or minimum within-cluster 
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average dissimilarity to define the exemplars. If there is a tie, we have 
used the minimum between-cluster average similarity or maximum between- 
cluster average dissimilarity to define the exemplars. This always happens 
when a cluster comprises 2 cases or has a symmetric arrangement of more 
than 2 cases. 

Other measures are possible. For example, it would be interesting to use the 
maximum row-wise rank correlation within a cluster to define exemplars, as 
this would then be consistent with the re-ordering criterion. However, we 
have not yet explored alternative exemplar criteria. In Fig. 4, the exemplars 
Nissan, Vauxhall and BMW for the 3-cluster partition are underlined. 




Figure 4: Dendrogram re-ordered by seriation, exemplars underlined 



4 Dendrogram Truncation 

A large proximity matrix and dendrogram can be truncated by Clustan- 
Graphics3 to the subset of r typical members or exemplars for each cluster. 
This is achieved in ClustanGraphics3 by interrupting the clustering pro- 
cedure after n — r fusions, and saving the intermediate proximity matrix 
representing the appropriate measures of proximity between the r clusters. 
In addition, cluster sizes or cumulative cluster weights must also be retained 
so that the algorithm can be re-started. This is necessary in order that the 
same final results are obtained for q clusters {q < r) if the dendrogram is 
truncated first at r clusters, then re-started and continued to q clusters. 



5 Protein Mutation Matrix 

In this application from the biological sciences, a dissimilarity matrix for 
20 species was read by ClustanGraphics3. It is displayed in shaded form 
in Fig. 5. The distance between two species is the number positions in 
the protein cytochrome-c molecules where the two species have different 
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amino acids (Fitch and Margoliash (1967)). Heavier shading reflects smaller 
mutation distances and hence more similar proteins. This distance matrix 
was reproduced from Hartigan (1975). 

The distance matrix was next classifled hierarchically using average linkage 
to obtain a dendrogram which is in an arbitrary order as generated by the 
cluster analysis, being dependent upon the order of the cases in the data 
matrix. There are over 262,000 different ways of presenting the dendrogram 
without having overlapping unions. 
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Figure 5: Distances between protein cytochrome-c molecules for 20 species 



The distance matrix was next re-ordered by ClustanGraphics3 using the 
seriation algorithm which converged after 3 iterations, resulting in the re- 
ordered proximity matrix shown in Fig. 6 and the dendrogram in Fig. 7. 
By comparison with Fig. 5, the new order of the proximity matrix in Fig. 
6 displays small distances close to the diagonal and the patterns of cluster 
groupings have become evident. The 20 species are more naturally ordered 
through their classes: flshes(l), reptiles(l), primates(2), other mammals(6), 
birds(4), amphibians(l), insects(2) and moulds(3). 



6 ClustanGraphics Overview 

ClustanGraphics3 provides useful facilities for creating flexible graphics for 
hierarchical cluster analysis which can be easily pasted into reports and 
presentations. It is interactive, intuitive and easy to use, exploiting colour 
for shading and the graphical user interface provided by Windows 95, 98 or 
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NT. Its features include functional buttons and a floating toolbar, pull-down 
menus, variable fonts, hints and context-sensitive help, online WinHelp, error 
checking and parameter defaults. 
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Figure 6: Re-ordered distance matrix for 20 species 



ClustanGraphicsS can read data matrices, proximity matrices, dendrograms 
and scatterplot matrices from text flies, spreadsheets or some databases. 
Dendrograms can also be read from several popular statistical packages such 
as Clustan, SPSS, SAS, S-Plus and BMDP. 

In addition to providing 11 standard methods of hierarchical cluster anal- 
ysis on proximity matrices, ClustanGraphicsS can construct dendrograms 
directly on data matrices comprising thousands of cases or thousands of 
variables. Examples involving 40,000 cases and 40,000 variables have been 
successfully analysed using Ward’s Method and Average Linkage on a PC. 
ClustanGraphicsS is therefore uniquely scalable for use with large surveys 
and in data mining applications (Wishart (1999)). 

A hierarchical cluster model can be further reflned in ClustanGraphicsS 
by k-means analysis or outlier deletion. New cases can be compared to a 
hierarchical cluster model and the best fitting cluster can be identified by 
ClustanGraphicsS. This feature has been used to classify over 4m cases by 
reference to a hierarchical cluster model of S0,000 cases. Significance tests 
are provided to help identify the most natural partition of a dendrogram for 
cluster model specification. 

A 44-page ClustanGraphics Primer is available (Wishart (1998)). 
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Figure 7: Optimally re-ordered dendrogram for 20 species 
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Abstract: The interpretation of cluster analysis solutions in the case of object- 
attribute data can be supported by methods of Formal Goncept Analysis leading 
to a conceptual understanding of the ’’meaning” of clusters, partitions and dendro- 
grams. The central idea is the embedding of a given cluster in a conceptual scale 
which represents the user’s granularity with respect to the values of attributes in 
the original data. This method is demonstrated using data from ALLBUS 1996. 



1 Interpretation of Cluster Analysis 
Solutions 

In this article we deal with the interpretation of cluster analysis solutions 
(Jardine, Sibson (1971), Bock (1974)) in the case that the data are given 
as an object-attribute table — formally described as a many-valued context 
(Ganter, Wille (1996)). Such an interpretation should relate the given cluster 
analysis solution with some suitable information given in the original data. 
We are not interested in discussing indices of cluster analysis solutions like 
homogeneity, stability etc. 

The cluster analysis solutions are mainly represented as a hierarchical tree 
(drawn as a dendrogram) where the leaves are labeled by the object names. 
The ’’meaning” of this dendrogram has to be explained to the user by de- 
scribing some ’’meaningful” partitions usually obtained by cuts of the den- 
drogram. Which answers satisfy the user? In most cases the user really 
likes to get an explanation of the form: ’’This cluster consists just of all the 
objects having the property P” where the property is formulated using the 
attributes and values of the given many- valued context. Subsets described in 
this form occur as extents and contingents of formal contexts derived from 
many- valued contexts by conceptual scaling (Ganter, Wille (1996)). This 
leads to the possibility of a conceptual understanding of the ’’meaning” of 
clusters — but usually not by an automatic procedure. The central idea 
is the joint conceptual representation by means of a nested line diagram of 
both, the given partition and a ’’suitable view” into the many- valued con- 
text. Such ’’suitable views” have to be constructed as conceptual scales 
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representing interesting aspects of the data with respect to the purpose of 
the investigation. 



1.1 ’’Objectivity” and the Problem of Interpretation 

Contributions to the discussion of the interpretation of cluster solutions are 
widely spread over application sciences, like natural sciences and the human- 
ities concerned with numerical taxonomy, pattern recognition, multivariate 
analysis methods and so on (cf. Dubes, Jain (1979)). 

Many researchers are engaged in the field of Cluster Validity looking for so- 
called objective criteria for decisions like the determination of the number of 
clusters and for procedures that ’’should provide an automatic decision rule 
to eliminate the problems of human subjectivity” (Milligan, Cooper (1985), 
p. 162). These procedures are often statistical tests — rising new problems 
in the interpretation since ’’statistical analyses require assumptions that are 
difficult to verify in individual situations” (Dubes, Jain (1980), p. 179). 

We criticize the above-mentioned conception of ’’objectivity” and propose 
the combination of the cluster analysis solution and suitably selected infor- 
mations obtained by subjective choice respecting the purpose of the investi- 
gation. 



1.2 Problems in Metric Knowledge Representations 

It is well-known that metrics are very useful in knowledge representations. 
This experience encouraged many scientists to introduce metrics as a hope- 
fully useful tool even in situations where the meaning of the chosen metric 
is not obvious — for example in arbitrary many- valued contexts. In many 
data analysis methods (for example in cluster analysis, correspondence anal- 
ysis, multidimensional scaling, principal component analysis) a metric on the 
set G of objects of a many- valued context (G, M, IT, I) is introduced using 
structures on the sets of values of the many-valued attributes. 

There are many open questions. How to introduce a metric if there are 
missing values? If there are no missing values, what is the meaning of a 
metric on the cartesian product of the sets of values of several many- valued 
attributes? Which structure on the set of non-negative real numbers is 
used in the process of interpretation — only the usual real ordering or even 
the algebraic or some other structure? Which granularity on the used sets 
is chosen in the interpretation? Which knowledge about the given many- 
valued context can be reconstructed from the derived distance matrix on 
the set of objects? What is the pragmatic role of the distance matrix in the 
final decision process? 

In our opinion one of the central problems in the application of metrics lies 
in the fact that metrics (like many other one-dimensional indices) are mainly 
used to reduce the given information — for example, the distance between 
two points in the plane does not contain any information about the direction 
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of the connecting line. Therefore the problem of interpretation of reduced 
information arises — and is often not solvable. 



2 Empirical Data 

In this article we use data from the ’’Allgemeine Bevolkerungsumfrage der 
Sozialwissenschaften” (ALLBUS, General Population Survey from Social 
Sciences). The ALLBUS 1996 (ZA (1996)) is a survey financed by the Fed- 
eral Republic of Germany and its federal states via GESIS (Gesellschaft 
sozialwissenschaftlicher Infrastruktureinrichtungen) which is realized by 
ZUMA (Zentrum fiir Umfragen, Methoden und Analysen, Mannheim) and 
ZA (Zentralarchiv fiir empirische Sozialforschung, Koln) in co-operation with 
the ALLBUS-Ausschufi. 

For our examples we choose data concerning views to ethnic groups in Ger- 
many and immigration, particularly the rating of importance of requirements 
for naturalization. Due to limitations in implemented hierarchical cluster 
analysis procedures we use a reduced sample of 8 (of 436) variables and 208 
(of 3518) interviewees (inhabitants of the federal state Hesse). The question 
was: ”How important are in your opinion the following criteria for giving 
German national status to a person?” [Scale from 1 (not important at all) 
to 7 (very important)] 

• person was born in Germany (variable v89) 

• person is of German extraction (v90) 

• person has command of German language (v91) 

• person has been living in Germany for a long time (v92) 

• person is willing to accomodate to German lifestyle (v93) 

• person is member of a Christian Church (v94) 

• person commited a punishable act (v95) 

• person can earn her/his living (v96). 

The example from Figure 1 has been computed using the complete-linkage 
method from STATISTIC A 5.1 for Windows combined with squared eu- 
clidean distance applied to the above-mentioned data without rows contain- 
ing missing values. 



3 Understanding Clusters Using Formal Con- 
cept Analysis 

In this section the central idea of a joint conceptual representation of both, 
the given cluster analysis solution (dendrogram or just a partition) and a 
’’suitable view” into the given many- valued context is explained. To this end 
we need two basic tools namely conceptual scaling theory and nested line 
diagrams. For a short introduction the reader is referred to (Wolff (1994)), 
for the mathematical background to (Ganter, Wille (1996)). 
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3,1 Conceptual Description of Partitions and Cluster 
Hierarchies 

Using just the basic definitions of a formal context and its concept lattice 
it is easy to see that any cluster hierarchy (and therefore any partition) 
can be represented without loss of information by the concept lattice of the 
formal context (G, M, g) where G is the set of objects of the hierarchy and 
M the set of all clusters of the hierarchy. As an example we take a partition 
with 6 clusters. The concept lattice of the corresponding formal context is 
represented in Figure 1. 




Figure 1: Nominal scale of a partition 



Reading example: The cluster named cl3 contains exactly 30 objects. The 
top element represents the concept of all objects, the bottom element repre- 
sents the concept of all those elements which have all attributes (belong to 
each cluster), hence it is the concept with the empty set as extent. 

In the following example we construct a rough view into the given data. For 
each of the two variables v95 and v96 we are interested in the question how 
many interviewees rated the importance for these variables high (value > 
5) respectively low (value < 4). The concept lattice of the corresponding 
formal context is represented in Figure 2. 
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168 



can earn 
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Figure 2: Scale for variables v95 and v96 



There are exactly 168 interviewees rating both variables high, there are 
exactly 19 interviewees rating v95 high, but v96 low, and there are exactly 
16 interviewees rating both variables low. 
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3.2 Interpretation of Clusters in Concept Lattices 

The concept lattice in Figure 2 is by construction just a direct product of 
two chains of length 2 (representing a 2 x 2-contingency table) . Since we are 
interested in understanding clusters in terms of the values of all 8 variables 
we represent also the other 6 variables in three pairs like in Figure 2. For 
the variable v94 we decided to scale ’’high” with ”> 4” instead of ”> 5” 
since most of the interviewees answered with values < 3 (low). 

Figure 3 shows a visualization which is obtained from Figure 2 by nesting in 
each of the given points another diagram isomorphic to Figure 2 representing 
the next two variables in the rough scaling explained above. This nesting 
procedure is done three times yielding a nested line diagram of depth 4 
with 8 attributes. Using the program TOSCANA (Vogt, Wille (1995)) we 
calculated how many objects (interviewees) occupy each of the 256 cells (the 
possible object concepts). These frequencies are shown in Figure 3. 

But we also try to understand the conceptual meaning of a cluster (for ex- 
ample cluster cl3) by studying its distribution in this landscape of 256 cells. 
Therefore some cells are labeled by two numbers, the last is the frequency 
of this cell, the first is the number of interviewees of this cell which are also 
members of the cluster cl3. 

The black cells in Figure 3 represent object concepts, the gray cells represent 
the other concepts of the embedded concept lattice. Each gray cell represents 
a supremum of object concepts, hence the corresponding extents are also very 
meaningful subsets. The white cells represent the concepts of the chosen 
scale, in this example the possible object concepts. 

The 19 interviewees which rated v95 high and v96 low can be seen (like 
in Figure 2) in the left quarter of Figure 3. This quarter is divided again 
into 4 quarters where the left one represents those 7 of the 19 interviewees 
which rated ”in Germany for a long time” high and ’’born in Germany” 
low. Among these 7 interviewees 5 rated ’’member of a Christian Church” 
low, ’’command of German language” high and ”of German extraction” low. 
Four of these 5 interviewees rated ’’accomodation to German lifestyle” low 
and these 4 are in this rough scaling indistinguishable, they build up a 
contingency class and they are — as everybody would imagine — contained 
in one cluster, namely cluster cl3. But some other contingency classes are 
separated in different clusters. 

How to use Figure 3 to understand the conceptual meaning of cluster cl3? 
First of all we see that cluster cl3 is contained in the set of all interviewees 
which rated ’’commited punishable act” high and — with one exeption — 
’’born in Germany” low. With another exception all interviewees in cluster 
cl3 rated ”of German extraction” low. 

Finally we mention two other interesting facts which can be seen from Figure 
3. With only one exception all the interviewees which rated ’’commited 
punishable act” low also rated ’’member of Christian Church” low. The 
second fact is that there is a contingency class with 45 elements which can 
be described very simple since it is just the set of all interviewees with a 
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low rating for the variable ’’member of Christian Church” and a high rating 
for all other variables. But this meaningful subset is not a cluster in the 
partition above. Clearly there are many other meaningful subsets which can 
be described as unions of contingency classes. 
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Figure 3: Nested line diagram of depth 4 with 8 attributes 

The distribution of a cluster over the contingency classes of a nested line 
diagram can be visualized by the program TOSCANA by choosing the 
nominal scale of the given partition and zooming into the actually interesting 
cluster, hence representing only the elements of the cluster in the nested line 
diagram of the direct product of the scales of the variables which are chosen 
to explain the conceptual meaning of the given cluster. 

There are many other possibilities to combine the nominal scale of the given 
partition with meaningful scales of variables but especially useful is also the 
method to represent an interesting extent or contingent in the nominal scale 
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of the given partition by choosing the nominal scale as the last scale in the 
priority list of scales. 

Both methods nominal scale with first or with last priority are useful for 
checking whether the given cluster has a presumed conceptual meaning, but 
we don’t see an automatic way to find a ’’good” conceptual meaning of a 
cluster. 

4 Conclusion 

Cluster analysis is based on two simple ideas which are often useful, but 
too narrow for data analysis in general, namely the idea of a metric and 
the idea of a hierarchical tree. A metric is a typical information reducing 
one-dimensional index and a hierarchical tree just a nested partition which 
restricts the language about meaningful subsets in data heavily. If the data 
are given as an object-attribute table (formally a many- valued context) then 
the usual constructions of metrics on the set of objects result in distance ma- 
trices from which the original many- valued context cannot be reconstructed. 
We believe that this is the main reason for the difficulties in the interpreta- 
tion of cluster analysis solutions. 

In contrast to this situation a concept lattice derived by conceptual scaling 
from a many- valued context represents not only the extents of all those at- 
tributes which have been selected as meaningful with respect to the purpose 
of the investigation, but also the whole closure system of the extents (and 
the closure system of the intents). The closure system of the extents, the 
partition of the contingents and the extent partitions can be used as building 
blocks to construct many partitions whose meaning can be easily described 
in the selected language of the conceptual scaling. Therefore we believe 
that the conceptual construction of meaningful clusters with respect to the 
purpose of the investigation is much more effective in practice than cluster 
analysis solutions especially if the used metrics are not directly related to 
the data. 
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Abstract: Data contexts, which describe the relation between objects, attributes, 
and attribute values, can often be described advantageously by algebraic struc- 
tures. In this contribution, a representation of data contexts by finite abelian 
groups will be discussed. For this representation, a framework is given by con- 
texts which have the elements of a group as objects (which label the rows of the 
data table or context), the elements of its character group as attributes (which 
label the columns), and the elements of the complex unit circle as attribute values 
(these label the entries in the cells of the data table). The non-empty extents of 
the appropriately scaled context are exactly the subgroups and their cosets. For 
the analysis of data, it is important to examine which data contexts are isomor- 
phic to or can be embedded into such group contexts. This will be explained by 
some examples, in particular taken from the field of experimental designs. 



1 Algebraic Representation of Data Contexts 

While analyzing data, the ability to represent them effectively is always a de- 
sirable goal. An effective representation of data supports the interpretation 
process and even the process of theory development. 

Most data we deal with are given in the form of a data table or data context. 
The formalization of such a data context is the many-valued context 
K {G, M,W, I), which consists of the three sets G, M, and W, and the 
ternary relation I C G x M x PF, where {g,m,w) G I and {g^m,v) G I 
implies that w = v. The elements of G are called objects, the elements of M 
attributes, and the elements of IF are called attribute values. If (^, m,w) ^ I 
holds, then we say that the object g has the value v for the attribute m. 
Many-valued contexts which describe data sets can be represented by alge- 
braic structures like vector spaces: A many- valued context K (F, F*, AT, E) 

is called a bilinear context if F is a (finite dimensional) vector space over 
the (skew) field K, V* is the dual vector space belonging to F, and E is 
given by (x, (f^k) e E (p{x) = k. In other words: This context can 
be understood as an affine space, where the objects are the points, each 
attribute corresponds to a parallel bundle of affine hyperplanes, and two 
points have the same value with respect to an attribute if they lie on the 
same affine hyperplane of the corresponding parallel bundle. Thus, bilinear 
contexts give rise to a geometric representation of data: The objects are 
represented by points, the attributes are represented by sets of parallel hy- 
perplanes, and two objects have the same value for some attribute if they lie 
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on the same hyperplane of the parallel bundle of this attribute. A complete 
characterization of bilinear contexts can be found in [WILLE (1991)] and 
[WILLE (1996)]. However, in most of all cases many- valued contexts are 
neither bilinear nor isomorphic to a complete bilinear context. But some- 
times a many- valued context can be embedded into some bilinear context, 
which means that it is isomorphic to a smaller part of the bilinear context, 
i.e. to the context that is induced by a subset of objects and a subset of 
attributes of the bilinear context. Formally, this is defined as follows. An 
embedding from a many-valued context to another consists of three map- 
pings: Let K := (G, M, VF, I) and K' (G', M', IF', I') be two many- valued 
contexts; further, let a : G G',/3 : M — )• M', and ^ \ W W be 
injective mappings. Then (a, /3, 7) is an embedding from K into K' if, for 
g eG,me M,w eW, 

{g, m,w)€ I {a{g), /3{m),^{w)) € 

A many- valued context that is embeddable into a bilinear context has a geo- 
metric representation, which is a part of the geometric representation of the 
corresponding bilinear context: The objects are represented by some points, 
and each attribute is represented by a set of some parallel hyperplanes. 
Two objects have the same value with respect to an attribute if their cor- 
responding points are found on the same hyperplane of the set of parallels 
that corresponds to the attribute. Apart from a simple method of repre- 
sentation, an embedding of a data context into a bilinear context also offers 
information about the interdependencies between the attributes. While the 
linear representation is closely connected with the vector space structure, 
there might be other algebraic structures that give rise to bialgebraic con- 
texts. Thus, they help us understand and formalize interdependencies, even 
in those data sets that cannot be embedded into a bilinear context. Let us 
regard the following example: 

Example: Data Context of German presidents 



president 


age of entrance 


terms of office 


party 


Heuss 


> 60 


2 


FDP 


Liibke 


> 60 


2 


CDU 


Heinemann 


> 60 


1 


SPD 


Scheel 


< 60 


1 


FDP 


Carstens 


< 60 


1 


CDU 


Weizsacker 


> 60 


2 


CDU 



Suppose that the data context of the example has an embedding into a 
bilinear context. Since the attribute “terms” is functionally dependent on 
the attributes “age” and “party”, the vector space of the bilinear context 
may be assumed to be 2-dimensional. In a vector space, two vectors generate 
exactly one affine line. If “Heuss” and “Liibke” have the same value “> 60” 
with respect to “age”, then the vectors on which they are mapped lie on 
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the affine line characterized by the linear form on which “age” is mapped 
and by the value “> 60” . Since “Heinemann” has also the same value with 
respect to “age”, its vector belongs to the same affine line. Now “Heuss” 
and “Liibke” have also the same value with respect to “terms” , from which 
we can conclude that the linear forms on which “age” and “terms” are 
mapped characterize the same parallel bundle of affine lines. Therefore, 
“Heinemann” ought to have the same value as “Heuss” and “Liibke” with 
respect to “terms”, which is not the case. This contradiction leads to the 
conclusion that we have to refrain from a linear representation of these data. 



2 Many-valued Contexts Described by 
Abelian Groups 

In the past, bialgebraic contexts of abelian groups have proved to be ap- 
propriate for the description of data (see [VOGT, WILLE (1994)]). The 
concept of a bialgebraic context of an abelian group can be generalized to 
the many-valued case: A many-valued context K := (G, G*,S,jE) is called 
a group context if G is a finite abelian group, G* is its dual group - i.e. 
the set of all group homomorphisms from G into the complex unit circle - 
S is the complex unit circle, and E is given by {g^ (p^c) e E (p{g) = c. 
We will now demonstrate that the example can be embedded into a group 
context for some appropriate finite abelian group. 

The example context can be embedded into the group context (Ze x Ze, (Ze x 
Ze)*, S, E) by the following embedding (o, 7), where a maps the set of the 

presidents to Ze x Ze, /? maps the attributes ” ‘age” ’, ” ‘terms” ’, and ” ‘party” ’ 
to (Ze X Ze)*, and 7 maps the attribute values to the complex unit circle. 



a : 


/3: 


7 : 




o:(Heuss) := (2,3), 


/?(age) := x, 


7 (< 60) : 


= 0), 


0!(Lubke) := (2,0), 


/3(terms) := x + 2 y, 


7 (> 60) : 


= e'^-2(-> 2), 


Q!(Heinemann) := (2,2), 


/3(party) := y 


7(1) := e 


0 ), 


o;(Scheel) (0,3), 




7(2) := e 


2 ), 


Q!(Carstens) := (0,0), 




7 (CDU) 


:= e‘^-0(-> 0), 


o:(Weizsacker) := (2, 0) 




7 (SPD) : 


= e*^'^(— )• 2), 






7 (FDP) : 


:= e*^'^(— > 3). 



For verification, the relevant parts of (Ze x Ze, (Ze x Ze)*, S, E) are given in 
the following table. As values in this context, we have taken the elements 
of Ze instead of the elements of the unit circle. This is already suggested in 
brackets at the description of the mapping 7. They give the correspondence 
between a value of the unit circle and the appropriate element of Ze. This 
is done in order to simplify the notation. It is admissible because Ze is 
isomorphic to the union of all homomorphic images of Ze x Ze in the unit 
circle. Consequently, this notation justifies the notation of the elements of 
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(Ze X Ze)* as mappings from Ze x Ze into Ze. 





X 


y 


X + 2y 




(0,0) 


0“ 


0 


0 




7MT 


0 


1 


2 






0 


2 


4 






0 


3 


0 


















0 


2 




TmT 


2 


1 


4 






2 


2 


0 






2 


3 


2 















By the above example it seems to be justified to have a closer look at 
the conditions which an arbitrary many-valued context has to fulfill to 
be embeddable into a group context. As a first step towards an embed- 
ding theorem we need an exact knowledge of the structure of group con- 
texts. Formal concept analysis provides methods to analyze and visualize 
structures of many-valued contexts via formal contexts. A formal con- 
text is a triple K := (G, M, I) where G and M are sets and I C G x M 
is a binary relation. It is often represented as a cross table, where the 
names of the objects label the rows and the names of the attributes la- 
bel the columns, and a cross is put in the cell of row g and column m if 
(p, m) € /. A formal concept of a formal context K := (G, M, I) is a pair 
(A, 5) with A C G,B C M^A — {g £ G|(^,m) G I for all m G J5}, and 
B = {m G G I for all g G A}. A and B are called the extent 

and the intent of the formal concept (A^B), respectively. Together with 
the partial order defined by (A, B) < (G, D) A C C all the concepts of 
a formal context form a lattice. This lattice is called the concept lattice 
of the formal context. In the order diagram of the concept lattice, the con- 
cepts which are generated by a single object or attribute are labelled with 
the name of this object or attribute. Now the concept lattice provides the 
same information as its formal context: An object g has an attribute m 
{{g^m) G I) if there is an ascending path from the circle representing the 
concept of g to the circle representing the concept of m in the order diagram. 
In order to represent a many-valued context by a concept lattice, we have to 
transform it into a formal context. This transformation is called conceptual 
scaling. Let us regard the group context of Z 2 x Z 2 . 








0 


X 


y 


x + y 


(0,0) 


1 


1 


1 


1 


Tmt 


1 


-1 


1 


-1 


IW 


1 


1 


-1 


-1 


IMT 


1 


-1 


-1 


1 
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As we can see, there are only two values, 1 and -1. The simplest scaling 
method is to replace either 1 or -1 by a cross and to leave the remaining 
cells blank: For each value c G S and each group context Kq, we can define a 
scaled context (G, G*, P) where P is given by {g, ip) G P if ip{g) = c. 

This yields the following two formal contexts: 




with the set of extents il(Ki ) = {{(0,0)}, {(0,0), (1,0)}, {(0,0), (0, 1)}, {(0,0), 
(1,1)}, {(0,0), (1,0), (0,1), (1,1)}} and 




with the set of extents H(K_i) = (0, {(1, 1)}, {(0, 1)}, {(1, 0)}, {(1, 1), (0, 1)}, 

{( 1 , 1 ), ( 1 , 0 )}, 

{( 0 , 1 ),( 1 , 0 )},{( 0 , 0 ),( 0 , 1 ),( 1 , 0 ),( 1 , 1 )}}. 

These two contexts give rise to the concept lattices of Figure 1 and Figure 




Figure 1: Concept lattice of 
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Figure 2: Concept lattice of K^JxZ 2 



As we can see, the extents of KI 2 XZ 2 exactly the subgroups of Z 2 x Z 2 , 
which, in general, is shown in [VOGT (1995)]. The extents of 
exactly the cosets of subgroups of Z 2 x Z 2 not containing 1 which ^ave 
order 2 in the corresponding factor group (in this special case this means all 
cosets), and the empty set and the whole group; and -1 is a primitive unity 
root of order 2 (c is a primitive n-th root of unity if = 1 and / 1 
for all k < n). Generally, the following theorem can be proven: 

Theorem: Let G be a finite abelian group, let Ko be the corresponding 
group context, and let c € S be an n-th primitive root of unity. Then il(K^) 
is the set of all cosets of subgroups of G that have order n in the correspond- 
ing factor group, together with the empty set and the whole group. 

The proof is omitted here for reasons of space, but can be found in [GROSS- 
KOPFj. 



3 Experimental Designs 

An experimental design describes the set-up for an experiment. In an ex- 
periment usually some objects are tested with respect to several factors or 
treatments. A factor, like the subject’s age or the intensity of stimulation, 
consists of a set of modalities, like the differend ages of the subjects or the 
possible intensities of stimulation. The experimental design specifies which 
combination of modalities of the factors is tested with which object. If 
the design consists of all (theoretically) possible combinations of modalities, 
then the design can be understood as the cartesian product of the differ- 
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end factors. In this case, it is called a complete design. But in most of 
all cases a design is not complete because it is too costly or because some 
factors cannot be crossed, e.g. for technical reasons. Then the design can 
be understood as a subset of the cartesian product of the factors. 

A good method of representing an experimental design is to represent it in 
the form of a many- valued context. The objects that are to be tested are 
taken as the set of objects of the context, they label the rows. The factors 
can be considered as attributes, each factor labels a column. As attribute 
values we have the elements of the factors, i.e. the modalities. Then we enter 
into the cell of a certain object and a certain factor the modality under which 
this object is tested with respect to this factor. 

The example context of German presidents can also be understood as an 
experimental design, but it is not complete, since not all possible combina- 
tions of age, number of terms, and party are realized. The representation of 
a complete experimental design does not raise any problem: Any complete 
design can always be embedded into a bilinear context of sufficient size, for 
it does not have any functional dependencies of attributes. Figure 3 shows 
the concept lattice of the complete version of the example context. 




Figure 3: Concept lattice of a complete experimental design 



Here all combinations of all modalities of the three factors party, terms and 
age are realized. Each of the object concepts labelled with OO, 01, . . . , Oil 
stands for one of these combinations. 

The representation of designs that are not complete is more interesting; in 
[VOGT, WILLE (1994)] the design for an experiment concerning the access 
to lexical information is examined. The lattice of the equivalence relations 
that the factors give rise to cannot be embedded into the subgroup lattice 
of Zg X Z 2 X Z 4 . Although the structure of the lattice suggests a bialge- 
braic embedding of the context of the design into the bialgebraic context 
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(Zg X Z 2 X Z 4 ,Zg X Z 2 X Z 4 ,_L), this is not yet possible. Only by adding 
auxiliary equivalence relations, the embedding is made possible. This shows 
that it is sometimes necessary to introduce additional attributes to allow an 
embedding. 

The structures of non-complete experimental designs can also be described 
by means of group theory: Let A, B be two factors and A^B the two par- 
titions defined by A and B (respectively). A is nested into B, if A is finer 
than B, i.e. if for every a E A and every b E B holds that either a C b or 
anb = ^. A and B are crossed if each ^-class intersects each 5-class. 

In a group context, interpreted as an experimental design, a factor A is 
nested into B if, for the corresponding attributes or group homomorphisms 
(fA and ifM, it holds that {g € G|<pa( 5 ) = 1 } ^ {5 € G | - !}• 

Two factors A and B are crossed if G is isomorphic to the direct sum of 
{g eG\ ifA{g) = 1 } and {5 e G | ^pB{g) = !}• 

Obviously, group contexts have an application not only when dealing with 
data tables and data analysis, but also in the field of experimental designs. 
Although these applications seem to be closely related, there are differences. 
In the case of experimental designs, the attribute values are fixed in advance 
and the measured values are in correspondence with the objects. In the case 
of data tables, objects and attributes are fixed and the measured values take 
the role of the attribute values. 
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Abstract: Assume that n ^-dimensional data points have been obtained and 

subjected to a cluster analysis algorithm. A potential concern is whether the 
resulting clusters have a ‘‘causal” interpretation or whether they are merely con- 
sequences of “random” fluctuation. In previous reports, the asymptotic properties 
of a number of potentially useful combinatorial tests based on the theory of ran- 
dom interval graphs were described. In the present work, comparisons of the 
asymptotic efficacy of a class of these tests are provided. As a particular illustra- 
tion of potential applications, we discuss the detection of mixtures of probability 
distributions and provide some numerical illustrations. 



1 Introduction and Summary 

Let Fx{x) be a cumulative distribution function on Eq, g^-dimensional Eu- 
clidean space, q > 1. We assume that Fx{x) is absolutely continuous 
with respect to g-dimensional Lebesgue measure and denote the correspond- 
ing probability density function by fx{x). Assume that a random sam- 
ple of size n has been obtained from Fx{x) and denote the realizations 

by X\^X 2 ^ Xn- In cluster analysis, similar objects are to be placed in 

the same cluster. We will interpret similarity as being close with respect 
to some distance on Eq. The relationship between graph theory and clus- 
ter analysis has been described in the books by Bock (1974) and Gode- 
hardt (1990). Mathematical results related to those used here are given 
in Eberl and Hafner (1971), Hafner (1972), Godehardt and Harris (1995), 
Godehardt and Harris (1998) and Maehara (1990). 

In order to proceed, we need to introduce some notions from graph theory. 



2 Graph Theoretic Concepts 

A graph Qn = (V, £^) is defined as follows. V is a set with |V| = n and £ 
is a set of (unordered) pairs of elements of V. The elements of V are called 
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the vertices of the graph Qn and the pairs in S are referred to as the edges 
of that graph. With no loss of generality, we can assume V = {1, 2, , n). 
For the purposes at hand, we choose a distance p on Eq and a threshold 
d > 0. Then for i ^ j, place (i, j) in £ if and only if p{xi^x^^ < d. Since 
xi,a; 2 , . . . are realizations of random variables, the set £ is a. random 
set and the graph is a random graph. In particular, these graphs are 
generalizations of interval graphs. Specifically, if Ii^ I 2 , . . . ^ In are intervals 
on the real line and In = then the interval graph Q{In) is 

defined by V = {1, 2, . . . , n} and (i, j) e £ if li H Ij ^ 1 < i < j < n. 

Thus, for the model under consideration, if g = 1, the intervals Ii are the 
intervals [xi — d/2, Xi + d/2], i = 1, 2, . . . , n. 

Let Vm C V with \Vm\ = Tn < n. Km 4 is a complete subgraph of order m, 
if all pairs of elements of Vm are in £. If m = 1, then ICi^d is a vertex, 
if m = 2, then IC 24 is an edge and if m == 3, then ICs^ is called a triangle. 
A vertex has degree z/, z/ = 0,l,2,...,n — 1, if there are exactly z/ edges 
incident with that vertex. If z/ = 0, then that vertex is said to be an isolated 
vertex. 



3 Probability Distributions for Characteris- 
tics of Real Interval Graphs 

We now describe the probability that a specified set of m vertices form 
a With no loss of generality, we can assume that these vertices are 

labeled 1, 2, . . . , m. Then, 



/ CX) 

{F{x d) — F{x)}^ ^f{x)dx, 

-oo 

and the probability that a specified vertex has degree u, z^ = 0, 1, . . . , n — 1, 
is 



Pr{l has degree u} = y J + d) - F{x - d)Y 

X {1 - F{x + d) + F{x - d)Y~^~^ f{x) dx, 

as given in Harris and Godehardt (1998) (please, substitute the wrong upper 
bounds n in Formula (1) of the cited paper by m). 

To obtain asymptotic approximations to the above distributions, some as- 
sumptions concerning the behavior of the probability density function fx{^) 
are needed. Hence we will assume that the probability density function is 
uniformly continuous on every compact subset of the carrier set for X and 
let /x(x) exist and be uniformly bounded on the carrier set of X. 
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4 Asymptotic Behavior of Probability Dis- 
tributions of Properties of Random Inter- 
val Graphs 

In this section, we examine the asymptotic behavior of the probability dis- 
tributions introduced in the preceding section, under the conditions n oo 
and d -A 0 (usually, d is a given function of n, d = d{n)) and also assuming 
the regularity conditions for fx{^) given above. For fixed m and d ^ 0, the 
asymptotic probability that the vertices 1, 2, . . . , m form a Km 4 is 

/ oo 

f”"{x)dx 

-OO 

(see also Harris and Godehardt (1998)). Also, the asymptotic means and 
variances of the number of complete subgraphs of order m are given by 

E{|^m,d|}~ (^) Pr{|/C^,d|} 



and 

Var{|/C„,rf|} - 

5 Rates of Convergence for Asymptotic Nor- 
mality 

From the theory of [/-statistics, the number of complete subgraphs of or- 
der m that will be observed in a graph on n vertices has an asymptotically 
normal distribution whenever -A oo and -> c > 0 as 

n oo and d 0. Hence, suppose m = 2, that is, we are counting the 
number of edges that are observed, and presuming that n^d is “large” and 
n^d^ is “moderate”. If the above rates apply, then the expected number 
of triangles is of the order n^d{nd) and tends to a positive constant and 
the variance of the number of triangles tends to zero. Simlarly, for larger 
values of m, under this limiting process, asymptotically degenerate random 
variables will be obtained. This suggests very strongly that m — 2 provides 
more information for a given value of n than larger values of m. 



6 Detection of Mixtures of Probability Dis- 
tributions 

A problem closely related to detection of clusters is the detection of mixtures 
of probability distributions. A mixture of k probability density functions is 
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a probability density function of the form 

k 

fx{x) 

i=l 

and let > 0 satisfy J2i=iC^i — 1 - In order to avoid some mathematical 
complications, we require that the probability distributions whose probabil- 
ity density functions are fi(x) be mutually absolutely continuous and that 
they are distinct (that is, there is some set A(i,j) of positive Lebesgue mea- 
sure for which fi{x) dx ^ fj{^) do; for z 7 ^ j, 1 < i < j < fc). Let 

Xi, X 2 , . . . , Xn be a random sample from fx{^) and let d > 0 be a specified 
threshold. The asymptotic approximations to be used here are valid under 
the assumption that d — > 0 and n 00 so that oc. Then, for 

example, the asymptotic probability that m randomly selected realizations 
result in a complete subgraph of order m is 

Pr ~ mcT-^ ^ (. ^ ft(x)dx\, 

where Jz > 0, i = 1, . . . , /c and ji = m. (This formula corrects a printing 
mistake in Harris and Godehardt (1998), where for the special case k — 2, 
2md^~^ has been written in Formulas (12)-(14) instead of the correct factor 
md'^~^). 

We propose proceeding as follows. First, we postulate that there is no mix- 
ture, that is. A: = 1, ai = 1. We estmate the expected value and variance 
of ^m,d under this assumption and compare the observed number of com- 
plete subgraphs of order m with this mean. If it differs substantially (relative 
to the variance) from the estimated expected value, then we conclude that 
there is a non-trivial mixture. We can repeat this, postulating that there is 
a mixture of two distributions and observe whether the data is compatible 
with this assumption, and so forth. 

Naturally, the sequence of tests described are not stochastically independent, 
however, much practical statistical data analysis is carried out in essentially 
this manner. 



7 Mixtures of Exponential Distributions 

The following application of these methods is presumably of minor signifi- 
cance in cluster analysis, but arises naturally in reliability theory and risk 
analysis. Nevertheless, this can serve as a demonstration of the methodology. 
Here, we consider detecting mixtures of exponential distributions using tests 
based on the theory of complete subgraphs of order m. A natural way that 
might be proposed is to use the likelihood ratio test and the corresponding 
asymptotic theory. Unfortunately in the case of mixtures, the required reg- 
ularity conditions for the asymptotic theory of the likelihood ratio are not 
satisfied. Therefore, the present technique may be a suitable alternative. 
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Consequently let fi{x) = A^e Xi > 0, x > 0 for i = 1,2, . . . ,k. Hence, 
we get 



Pr {ICm,d} ~ ^ Y, 



m 






, .nkAf / eja, 



Two specific cases are considered. In order to simplify the computations both 
illustrations utilize k = 2 and m = 2\ that means we consider a mixture of 
two exponential distributions and are counting the number of realized edges. 

Example 1. Let X be the random lifetime of some device. It has been 
assumed that X has the exponential distribution with known intensity A. 
Subsequently, it is suspected that a second production source with a different 
intensity rate has been introduced. Therefore some of the data may be com- 
ing from the first production source and some from the second production 
source, and thus the observed data may be from a mixture of two exponential 
distributions. With no loss of generality, we can assume Ai == 1 . To provide 
a numerical illustration, 300 observations were simulated, 183 from an ex- 
ponential distribution with intensity Ai = 1, and 117 from the exponential 
distribution with A 2 = 2, hence a\ = a = .61 and and a 2 = 1 — 01 = .39. The 
simulated lifetime data has a sample mean x = .830 and a sample variance 
— .708. The threshold d was set at .01 resulting in an observed number 
of edges of 563. In addition, the sample second moment of the lifetimes is 
m 2 = 1.397 and the sample third moment m 3 = 3.402. For this specific 
mixture, the theoretical mean lifetime = .805 and the theoretical variance 
= .767. Hence the simulated data has reasonable agreement with the 
theory. 

For Example 1 , we would assume that Ai = 1 is known and that A 2 is 
unknown. The natural null hypothesis Hq is Of = 1, that is, that there 
is no mixing. If this is true, then the asymptotic value for the expected 
number of edges is 448.5. Similarly, the asymptotic value for the variance of 
the number of edges under the null hypothesis is 594.01 giving a standard 
deviation of 24.4. One notes that the observed number of edges is sufficiently 
large that the null hypothesis is untenable and hence the null hypothesis is 
rejected. For the specific mixture employed, the expected number of edges is 
587.8. Thus, it is clear that for Example 1 , detection of a mixture, using the 
number of edges, can be accomplished. Should one wish to estimate a and A 2 
from the data, there are several procedures that one might use. Typically, 
one would utilize the lifetime data and employ one of the standard statistical 
estimation techniques, such as the method of maximum likelihood or the 
method of moments. Because of its computational simplicity, the method of 
moments has been utilized, resulting in the estimates a = .25 and A 2 = 1.29. 
These values are compatible with the data, since the expected number of 
edges for the mixture is somewhat greater than the actual number of edges 
obtained. 



Example 2. In this example, data has been obtained which may be from 
a mixture of two exponential distributions, but in contrast to Example 1, 
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none of the putative exponential parameters is known. Hence we wish to 
determine if the data is from a single exponential distribution or from a 
mixture of two exponential distributions. To illustrate, we will use the same 
data as in example one. Since our intent is to demonstrate the use of these 
combinatorial techniques, we equate the number of edges observed to the ex- 
pected number of edges and solve for the exponeptial parameter A, obtaining 
A = 1.255. Using the lifetime data, the maximum likelihood estimate under 
this assumption is A = 1.204. If A = 1.204, then the expected number of 
edges is 540.0 and the variance of this number is 861.1 giving a standard de- 
viation of 29.3, and there is insufficient evidence using the number of edges 
to support the presence of a non-trivial mixture. 

If we postulate that the data is in fact from a mixture of two exponential 
distributions, then using the lifetime data and the method of moments, the 
following parameter estimates are obtained. The estimate of the mixing 
parameters, a = .54, Ai = 1.09, and A 2 = 1.38, which is compatible with 
the estimate 1.204 obtained under the assumption of homogeneity. 



8 Concluding Remarks 

The authors are continuing their investigation of these and related meth- 
ods. At present, an extensive examination of the method introduced by 
R.A. Fisher and, in particular, an analysis of the famous iris data (see 
Fisher (1936), Fisher (1938), Fisher (1940)) is under way to provide fur- 
ther information about the efficacy of these methods and to provide useful 
comparisons with other techniques. 
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Abstract: For some reinforcement learning algorithms the optimality of the gen- 
erated strategies can be proven. In practice, however, restrictions in the number of 
training examples and computational resources corrupt optimality. The efficiency 
of the algorithms depends strikingly on the formulation of the task, including the 
choice of the learning parameters and the representation of the system states. We 
propose here to improve the learning efficiency by an adaptive classification of 
the system states which tends to group together states if they are similar and 
aquire the same action during learning. The approach is illustrated by two simple 
examples. Two further applications serve as a test of the proposed algorithm. 



1 Introduction 

Reinforcement learning algorithms (Sutton (1988), Werbos (1990)) have 
shown promising perspectives, e.g. in control, game playing and scheduling 
tasks. They are particularly suitable in problems where the overall perfor- 
mance can be evaluated, but detailed knowledge about efficient strategies 
is not available. The general idea is that a randomly selected action, say a 
move at a given game configuration, is reinforced if it is successful and is, 
hence, used more frequently in similar situations in future. Reinforcement 
learning assigns actions to system states, which may be quite numerous al- 
ready in toy examples. In order achieve low computational complexity as 
well as faster convergence of the learning algorithm, it is therefore advisable 
to group states into classes and let learning determine one appropriate ac- 
tion for all states in that class. Ideally, states from where the same sequence 
of actions is a good strategy should form a class, but information about 
strategies cannot be assumed to be available before learning, initially not 
available. One may take resort in statistical classifications as provided by 
vector quantization algorithms. In this case an action is selected for each 
class by an intra-class average of received reinforcement, which is suboptimal 
compared to the action-uniformity of an ideal classification. Further, a sta- 
tistical classification may become misleading as the state frequency depends 
via the selected actions on the course of reinforcement learning. 

The algorithm presented in this contribution was designed to resolve the 
dilemma between inefficient statistical classification on the one hand and 
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missing information for behavior-dependent classification on the other hand. 
Clearly, one has to start with occurring states and relates thus implicitly to 
the state statistics. In the course of action learning, however, the initial clas- 
sification will be modified by including information about strategies, which 
is more efficient with respect to the performance of the whole system and 
also reduces the sensitivity to changes of the state statistics. 

The optimality criterion applied here consists in minimizing the ambiguity 
of the state classes produced by the vector quantizer, i.e. state classes are 
chosen to be such that a single action is appropriate for any state in a class. 
We will illustrate the problem and its solution, resp., by two simple tasks, 
a positioning (section 2) and a pole-balancing problem (section 3). More 
realistic problems are discussed in section 4. 



2 Reinforcement Learning 

2.1 Q-learning 

A common implementation of the reinforcement learning paradigm is the 
Q-learning algorithm (Watkins (1992)). An introduction to this and re- 
lated algorithms is given in the recent book by Sutton and Barto (1998). 
Q-learning is a model-free method, which similarly as in dynamical opti- 
mization is supposed to assign action sequences to state sequences such that 
a goal is is reached or an evaluation function is maximized. 

Optimal action sequences {a{t)} are determined relying on reinforcement sig- 
nals r{t). From r{t) the algorithm constructs a Quality function Q(^, a) that 
contains information on the suitability of actions performed in the present 
state ^(t) to frequently reach states with high reinforcement. Using an ap- 
propriate discretization of the generally continuous state space S the states 
^ can be represented by a a finite number of state classes j. Classes are 
formed by partitioning S, which will be subject of the following section. 
Given j{t), an action a{t) to be applied is chosen according to 

a{t) = argmax a) (1) 

It is adjusted one time step later using r{t) and the maximal value 
of the Q-function in the next step with a discount factor 7 < 1, i.e. 

= m&xQ{j{t),a). (2) 



AQ {j{t - 1), a{t - 1)) = e (r(i) + 'yV{j{t)) ~ Q{j{t ~ 1), - 1))) (3) 

The Q-function thus approximates the expected future reward on carrying 
out an action in the present state under the assumption that the currently 
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learned strategy is followed thereafter. If 7 < 1 the future reward is dis- 
counted, i.e. a reward of r enters the Q-function reduced by a factor of 7^ 
if it is received only k time steps later. During the learning process rule (1) 
is applied only with a certain probability and randomly selected actions are 
used otherwise for a further exploration of the state-action space. 

In order to prove that an optimal strategy is found by Q-learning the sets of 
possible actions and rewards were assumed to be finite (cf. Dayan (1992)). 
Here a finite action set is used, which also allows for a simple look-up table 
representation of the Q-function. Concerning the reinforcement signal, bi- 
nary values are not necessary, but in many applications more information is 
not available. In the tasks considered below the reinforcement signal r{t) is 
zero except if the state has reached the target or follows a target behavior. 
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Figure 1: Car-centering problem. State-action correspondence and example 
trajectories produced by Q-learning with a fixed state space partition. Cart 
position is displayed in horizontal, velocity in vertical direction. Braces iden- 
tify the target area, where upon arrival of the cart positive reinforcement is 
given. The acceleration of a cart is determined by the action assigned to the 
closest reference point: o: a = Oq, +.* a = —Oq. 



2.2 An Example 

As a simple example we consider a positioning task. A cart on a track is 
accelerated by a force of constant strength acting in either of the directions 
of the track. Positive or negative acceleration is to be applied such that the 
cart arrives at the center of the track with zero velocity from a given position 
in minimal time. For a friction-less motion the theoretical solution of the 
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problem is expressed by the parabola branches in Fig. 1. Optimally the cart 
is accelerated towards the center until it reaches a velocity corresponding to 
the parabola. Then, the force is inverted such that the cart follows the curve 
with decreasing speed, which becomes zero at the target position. By and 
large the Q-learning algorithm based on a regular quantization of the speed- 
position vector of the cart reproduces the theoretical solution. Although the 
optimal partition into two regions corresponding to positive and negative 
force, resp., is rather simple, the learning time of the algorithm is excessively 
high. On the other hand a partition as used in Fig. 1 will hardly work with 
less units. This example demonstrates the need for efficient partitions, that 
are optimized with respect to the performance of the control algorithm rather 
than with respect to statistical properties of the state distribution (which 
is modified by the learning of strategies) or to “natural” assumptions about 
the homogeneity of the state space. 
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Figure 2: Illustrating a statistically optimal, an efficient and an ideal parti- 
tion of a state space. 
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Figure 3: Reinforcement learning with success- dependent partition of the 
state space. 
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3 State Space Partitioning 

3.1 Data Representation by Unsupervised Neural 
Nets 

In the present approach is a piecewise constant approximation of the “ideal” 
Q-function is provided by a neighborhood-conserving vector quantizer. In 
this way a faster organization of the partition and reorganization for a 
changing input distribution can be achieved and “dead” (unused) units are 
avoided. Moreover in the present setting the neighborhood interaction is 
modified in order to enforce uniformity with respect to optimal actions, 
cf. 3.2. 

Unsupervised neural network algorithms are well suited for this purpose. 
The network receives the current system state class G 2 as input. 2 is 
assumed to be bounded. The probability density P(^) should change only 
slowly and should remain concentrated inside 2 . The index of the maximally 
activated output neuron is used as a class j{t) in Q-learning. Among the 
available algorithms we favored the neural gas algorithm (Martinetz (1991)) 
rather than a self-organizing map (Kohonen (1995)) in order to avoid draw- 
backs of a prescribed topology of reference units. The basic update rule 
(Martinetz (1991)) for a class reference vector Wj G 2 representing the pro- 
totype state of class j reads 



Awj = - 7 ?exp . . .,WN)/Xj) {wj - 0, (4) 

where 77 is a small learning rate Topological relations between an input datum 
and the set of reference vectors w is introduced in Eq. (4) by the exponen- 
tial, which depends on the rank kj of the Euclidean distance between Wj and 
^ among all such distances. In particular, for Xj 0 only the unit j closest 
to ^ will be influenced by the update, whereas for larger Xj other neurons, 
too, are attracted towards In contrast to the original formulation (Mar- 
tinetz (1991)) here Xj may vary across units. The resulting asymmetrical 
interaction allows to move reference vectors to regions of E which are only 
poorly represented. 

3.2 Partitioning in a Self-supervised Architecture 

Usually the resulting partitions are dependent only on statistical proper- 
ties of the input data, which is not necessarily efficient in Q-learning. The 
construction of a partition, which depends on the performance of the follow- 
ing Q-learning module, is an example of the so-called hidden state problem 
(cf. e.g. Daz, Moser (1994)), which has been approached in (Herrmann, Der 
(1995)) by an optimal “division of labor” among the state classes. Classes 
that learn successfully are increased, whereas classes that cannot unambigu- 
ously select an action attract neighboring classes to share local work. In 
this way, regions of the state space with a more complex learning task are 
resolved more finely. 




307 



The maximal value of the Q-function at a given unit decides which action 
is to be executed at the next time step. In the ideal partitioning defined in 
the introduction, there is only one optimal action in each domain so that 
in the converged state the Q-function is stationary under the learning rule 
(3). Hence, the fluctuations of Q in a class i of the state space may serve 
as an indicator for the reliability of this class. When monitoring Q-values 
it becomes obvious which domains having already decided for an action. 
Otherwise, the unit ‘requests for assistance’ to its neighbors by means of an 
increased A^-value (cf. Eq. 4). In the case of two possible actions denoted 
by a(t) = ±1 we define the firmness / G {0, 1} of a unit i as 

fj{t) = \{a{t))t,jit)=j\ (5) 

(...) denotes a moving average with time constant r over those time steps 
where unit j was activated and the action was chosen according to (1). 
All firmness values are initially zero and the presently available estimate is 
used for modifying the parameters Xj of the neural gas quantizer. r~^ is 
proportional to the learning rate e in (3). Other definitions of the firmness 
are possible. In particular, in the case of more than two actions instead of 
(5) entropy measures based on the probability of success for each action are 
suggestive. The choice 

Xj{t) a (1 - (6) 

is not only distinguished by its simplicity, but has also the advantage of 
suppressing quadratically small deviations from maximal firmness, fj = l, 
which are unavoidable when using a nearest neighbor classifier to approxi- 
mate the smooth boundaries in the control problem. When neighborhood 
interaction is defined in this way, firm units will not disturb their neigh- 
bors whereas non-firm units will effectively attract firm units such that their 
domain shrinks and reliable control becomes possible, cf. Fig. 5. 




Figure 4: Pole balancing task. Displayed is the success rate as a function 
learning time. Reinforcement learning with a success- dependent partition 
(upper curve) and with a fixed partition into 6 x 3 x 3 x 3 boxes (lower 
curve). 

In addition to the example in section 2.2 the combined algorithm (cf. Fig. 3) 
has been applied to the cart-pole problem, where a pole is to be balanced by 
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Figure 5: Time course of the mean firmness value. After about I 4 OOOO time 
steps control is such that only unambiguous classes are visited. 



moving a cart along a track of fixed length (for details cf. Pendrith 1994). 
Fig. 4 displays the achieved improvement of learning time an success rate 
as compared to the original Q-learning algorithm with the same number of 
state classes. 



4 Applications 

4.1 Chaos Control 

The attractor of a chaotic system can be characterized by a multitude of 
unstable periodic trajectories. Because of the sensitive dependency of the 
system state on initial conditions long-term predictions are impossible. On 
the other hand, this fact allows to effect large changes of the behavior by 
tiny control actions imposed on the system. Even if the properties of the 
system are not explicitly known, it is possible to stabilize unstable orbits that 
satisfy certain conditions, e.g. having a fixed periodicity. An reinforcement 
algorithm has been applied to the problem of chaos control (Der, Herrmann 
(1994)). A similar approach to this task which was based on an unsupervised 
neural network has been presented by Funke et al. (1997). 



4.2 Autonomous Vehicles 

Reinforcement learning is of particular interest in robotics. In order to gener- 
ate meaningful behavior in an unknown complex environment, an appropri- 
ate representation of the environment is necessary. This representation can 
consist in a neural map of the space of possible sensory readings. We have 
performed experiments on a Khepera miniature robot which is equipped with 
infrared sensors that are activated when the robots approaches an obstacle. 
Actions are motor commands to the wheels of the robot. We have chosen 
the reinforcement signal to be r = 1 for free motion of the robot and r — — 1 
if the robot tries to move forward when opposed to an obstacle. The sensory 
space has been partitioned into 32 classes according to the scheme described 
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above. By the Q-learning algorithm the robot learns an effective obstacle 
avoidance behavior within 30 minutes or equivalently 50000 learning steps. 
In order to solve more complex tasks the representation of the environment 
should involve also higher-level features. In a forthcoming paper we have 
studied the extraction of positional information from sensory input based 
on a fixed elementary obstacle avoidance behavior, which is also realized in 
a Khepera miniature robot. 
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Abstract: We present some results devoted to the existence, numbers and orders 
of strong connected clusters and cliques in random digraphs and their match- 
graphs. These results may provide an interesting sociological interpretation. 



1 Introduction 

In sociometry, a so-called sociogram is often used to describe the structure of 
a sociological group. A sociogram is a digraph where the vertices represent 
the members of the group and arcs reflect how the members communicate 
between themselves. For example, each member can be asked to name a 
speciflc number of others in the group with whom they would like to interact. 
The standard instruction is often stated in the form ”name your d best 
friends”. It may occur - and it often does - that two distinct members 
choose each other. A pair of such members is called a match. In a case 
when two members do not choose each other we say that they form an 
independent pair. In sociometric investigations the total number of matches 
or independent pairs is often used as a measure of sociological cohesion. 
Obviously, whenever d > 2, matches are not necessarily disjoint. So it 
is interesting to consider subsets of the group, so called strong connected 
clusters^ such that each pair of their members is a match. A strong connected 
cluster which is not properly contained in any other strong connected cluster 
is called a clique of the group. 

When a group is relatively new, it will be less cohesive and the sociogram 
will have a random aspect. Therefore we can assume that each member of 
a group consisting of n people names independently some other members 
at random. In other words, the structure of a sociological group can be 
represented by a random digraph D = D{n^V) deflned as follows. 

Let V = {Pq, -Pi, • • • , Pn-i) be a probability distribution, i.e. an n-tuple of 
non-negative real numbers which satisfy 

Po + Pi + • • • Pn-l = 1 • 

Denote by Z)(n, V) a random digraph on a vertex set U — {1, 2, . . . , n} such 
that (here, and what follows, N~^{i) denotes the set of images of a vertex i): 

1) each vertex i eV ’’chooses” its out-degree according to the probability 
distribution P, i.e. 

Pr{|A^+(i)| = q = P,, 



A: = 0, 1, . . . , n — 1 
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2) for every S C V \ {i}, with |5| = k, the probability that S coincides 
with the set of images of a vertex i equals 

i.e. vertex i ’’chooses” uniformly the set of images 

3) each vertex ’’chooses” its out-degree and then its images independently 
of all other vertices. 

This model has been introduced in Jaworski, Smit (1987). Several properties 
of the random digraph 7)(n, V) were discussed in Jaworski, Palka (1997) and 
Jaworski, Palka (1998). 

In this paper we will study a match-graph of a digraph D{n,V), which is 
defined as follows. Replace each two-cycle (a match) of D{n,V) by an edge 
and omit the remaining arcs. Let us denote such a model by M(n,'P). In 
a case when 7^ is a binomial distribution B{n — l,g) one can check that 
the model M{n,V) is equivalent to a random graph M{n,B) on n labeled 

vertices in which each of ^ 2 ) possible edges appears independently with a 
given probability p — i.e. to a classical random graph G(n,p). 

Our main interest here is to investigate complete subgraphs and cliques of 
M{n,V). By a clique we mean a complete subgraph which is not properly 
contained in any other complete subgraph. Clearly a complete subgraph of 
M(n,'P) corresponds to a complete subgraph of D{n,V) in which each pair 
of vertices is joined by a two-cycle. 

The above graph theoretic definitions of a complete subgraph and clique of a 
digraph D{n^V) coincide with the notion of a strong connected cluster and 
clique of a social group. 

This paper is divided into two parts. The first part (Section 2) contains exact 
formulas for the first two moments of the number of complete subgraphs and 
for the expected value of the number of cliques in a random match-graph 
M(n,P). Keeping in mind a relation between a ’’new” social group and a 
random digraph jD(n,P), our considerations provide interesting sociological 
interpretations. 

In the second part (Section 3) we apply results from Section 2 to obtain 
asymptotic results about the existence, orders and numbers of complete 
subgraphs in a random math-graph of out-regular digraph D{n^d). By such 
a digraph we mean a very special case of D{n,V)^ namely when 7^ is a 
degenerated distribution, i.e. Pd = I for some d, l<d<n — 1. A match- 
graph of D{n^d) will be denoted consequently by M{n,d). 



2 Complete Subgraphs of M{n,V) 

Let Xr and Yr stand for the number of complete subgraphs and cliques on r 
vertices in a random match-graph M{n,V), respectively. We give here the 
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exact formulas for the first two moments of Xr and for the expectation of 
Yr- In particular, these moments are useful in investigations of the structure 
of small social groups. 

Let X^{i) be a discrete random variable having a probability distribution 
V i.e., 

Pr{X+(i) = k] = Pk, A: = 0, 1, . . . , n - 1 , 

which defines the out-degree of a given vertex i of a random digraph D(n, V). 
Actually, we consider n independent, identically distributed random vari- 
ables A’+(l), A"''(2), . . . , X'*'{n). Therefore we can write shortly X~^ instead 
of A+(i). 

We will need the probability that a given subset of vertices is contained in 
the set of images of a vertex i eV = {1, 2, . . . n}. 

As usually, Ek{X) stands for the A:-th factorial moment of a random variable 
X. Also, we write {n)k = n(n— 1) . . . {n—k+1). One can show (see Jaworski, 
Palka (1998)) 

Property 2.1 For a given i, 1 < i < n, let U C V \ {i} and \U\ = t > 1. 
Then 

Y>i{U CN+{i)} = —^Et{X+). □ 

First let us consider moments of the random variable X^. Let [/ be a subset 
of V and lu be an indicator random variable defined as follows: 

^ _ f 1, if [/ spans a complete subgraph in M{n,V) 

^ ~ \ 0, otherwise. 

Then for any subset [/ C V, |C/| == r, by the above property 

Pr{/^ - 1} - n MU \ {i} c - 

ieu 

Similarly for any given pair of subsets C/, W 

Pr{Iu^Iw^l} = n Pr{C/UW\{i}ciV+(i)} 

ieunw 

■ n MU\{i}cN+{i)} 

i€U\W 

• n MW\{i}cN+{{)}. 

t€W\U 

Hence again by our property, ii \U\ = \W\ = r and \U C\W\ = j where 
j = 0, l,...,r, then 



Er-l{X+) 
(n - l)r-l 



'E2r-j-l{X+y 


3 


\Er-i{X+)] 


_ ijl l)2r— j — 1 _ 




1 

7 

t-H 

1 

1 



Pr{/(7 = /w = 1} — 
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Since Xr is a sum of indicators lu over all r-element subsets U C V thus 



E(Xr) = 



n 

r 



Er-l{X+) 

(n - l)r-l 



(2.1) 



Similarly since 



E{X^) 



n\ A /n - r' 



r 



Pr{Iu = I\v = , 



we have also 

E2r-j-l{^~^) (n — l)r-l 

E?_,{X^) '{n-r)r-j\ • 

Now let us derive the exact formula for the expected value of Yr^ the number 
of r-cliques in a random match-graph M{n^V). Assume that [/ is a given 
r-element subset of V. For a given vertex x G F \ C/, let Ax be an event that 
there are edges between x and all vertices from tf. Also, let 

_ r 1, if [/ spans a clique in M{n,V) 

^ ~ \ 0, otherwise. 



E{X^) = E{XrfY.- 

i=0 



S~3 



(!) 



Then 



Pr{7^ = l} = Pr{/j; = l}Pr{ f| ASu = l}. 

xeV\U 

Now, for each T CV\U^ where \T\ = k and l</c<n — rwe have 



P(T) = Pr{ n = 

xeT 

= {Pt{Iu = 1})"^ • n Pr{V c N+{x)} ■ n MT u 17 \ {i} C N+{i)} 

xeT ieu 



'Er{X + y 


k 


lEk+r-l{X+) 1 1 


_(n - l)r_ 




[ Er-i{X+) (n-r)fcj 



Consequently, assuming that P(0) == 1 and using the principle of inclusion 
and exclusion we obtain 

E{Yr)= 5: Pr{/£; = 1}£(-1)'' E P(T). 

UCV k=0 TCV\U 

\U\=r \T\=k 



E{Yr) = E{Xr)Y.i-'^fSk, 

k=0 



Hence 



(2.3) 
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where 

O _ (n- A r ^r(^+) l' \ Ek+r-l{X+) 1 

" k j[(n-l)J [ ^.-i(X+) ’(n-r),. 



(2.4) 



We do not give here E{Y^) since its form is more complicated and in asymp- 
totic investigations we need only formulas for E{Xr) and E{Yr). Actually 
Jaworski, Palka (1997) proved the formula for all factorial moments of a 
random variable that counts subgraphs of a given type in the match-graph 



3 Applications to Out-regular Digraphs 

In this section we apply the exact results obtained in the first part of our 
paper to a match-graph M(n, d). 

Before we do this, let us mention a relation between M(n, d) and a classical 
random graph G(n,p). It turns out that in a case when 




results related to the existence and the number of subgraphs of a given type, 
are in these two models asymptotically the same. It should not be surprised 
since the probability that there is an edge between two given vertices in 
M(n, d) equals to 

The main difference between random graphs G{n,p) and M{n,d) is that in 
the first model each edge occurs independently of all other edges whereas in 
M(n, d) the collection of edges is not independent However, it turns out 
that dependence between edges on specified finite sets of vertices in M(n, d) 
is very small when n is large. 

Now having in mind formulas (2.1), (2.2) and (2.3) one can immediately 
obtain corresponding results in a match-graph M(n, d), since in this case 
the r-th factorial moment of the random variable has a very simple 
form 

Er{X^)^{d)r. (3.1) 

To illustrate the behavior of the expected number of complete subgraphs 
and cliques of a given order, let us consider a random digraph M(n, d) on 
n — 50 vertices. Table 1 contains values of E{Xr) (the left column) and 
E{Yr) (the right column) with respect to different values of d (the sign 
indicates that the respective value is less than 0.01). The analysis of Table 1 
shows interesting properties of orders of complete subgraphs and cliques in a 
random digraph M(50, d). Let us remark here, for example, that one cannot 
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expect to find a clique of order less than 5 or greater than 10 in M(50,40). 
It means that in this case the order of the smallest clique is a half of the 
order of the largest clique, which obviously coincides with the order of the 
largest complete subgraph. 



Table 1: Complete subgraphs and cliques in M(50,d) 



Order 


d = 


10 


d = 


20 


d = 


30 




40 


(r) 




~EYT 


EXr 


EYr 


EXr 


EYr 


EXr 


EY/r 


2 


51.0 


47.8 


204.1 


56.6 


495.2 


- 


816.3 


- 


3 


1.1 


1.1 


82.7 


70.1 


991.9 


93.5 


5718.9 


- 


4 


- 


- 


3.4 


3.3 


543.1 


262.4 


19045.2 


0.6 


5 


- 


- 


- 


- 


76.7 


61.6 


31634.9 


87.5 


6 


- 


- 


- 


- 


2.8 


2.6 


26827.7 


921.7 


7 


- 


- 


- 


- 


- 


- 


11726.1 


1717.3 


8 


- 


- 


- 


- 


- 


- 


2643.2 


894.6 


9 


- 


- 


- 


- 


- 


- 


305.5 


167.4 


10 


- 


- 


- 


- 


- 


- 


17.9 


12.9 


11 


- 


- 


- 


- 


- 


- 


0.5 


0.4 


12 


- 


- 


- 


- 


- 


- 


- 


- 



^From now on we shall consider large random match-graphs M(n, d), i.e. 
we shall assume that n oo. If tt is a property of a graph G on n vertices 
than an assertion ”G has property tt almost surely (a.s.)” means that 



Pr(G has property tt) -> 1 as n oo . 



In what follows the symbols o, O and ~ are used with respect to n — > oo. 

It is clear that the orders of complete subgraphs in M(n, d) depend on the 
values of out-degree d. If d > 1 is a constant then the number of 2-complete 
subgraphs asymptotically has a Poisson distribution with expectation d^/2 
(see Smit (1979)). Moreover, a random match-graph M(n, d) a.s. does not 
contain complete subgraphs of order r > 3. As a matter of fact, by (2.1) 
and (3.1) 



Pi{Xr > 1 } < E{Xr) = 



n 

r 



{d)r — l 
{n - l)r-l 



O = o(l) . 



It appears that in order to guarantee the existence of at least one small 
complete subgraph in M(n, d), the out-degree d has to be large. Indeed, the 
threshold function d*{n) for the property that M{n,d) contains a complete 
subgraph of a fixed order r, r > 3, is given by (see Jaworski, Palka (1997)) 
d*{n) = i.e., 



Pr{M(n, d) has r — complete subgraph} — > 



0, if d/d* -> 0 

1, if d/d* — ^ oo. 
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Moreover, (see Jaworski, Palka (1997)), for any fixed r, r > 2, and a constant 
c, 0 < c < oo, if 

lim = c, 

n->oo 

then Xr asymptotically has a Poisson distribution with parameter A = 
. On the other hand, if 



lim dn = uin ) , 

n— vnn ^ 



where o;(n) — >• oo as n — > oo but w(n) = o{n^) for every (5 > 0 and r is 
fixed (r > 2), then (see Jaworski, Palka (1997)) the standarization of Xr 
asymptotically has a normal distribution M{0, 1). More precisely 



X, - A(n) 



~ A/*(0, 1) , 



where A(n) = 

The formula for the threshold function for the property that a random 
match-graph M(n, d) contains r-complete subgraphs insures that if 

d = , 

where 7 (n) is a function tending to infinity as n -> oo and 6 > 0 is an 
arbitrary small constant, then M(n, d) contains a.s. a complete subgraph 
of order r, where r = 1 + j is an arbitrarily large constant. Clearly, if 
7 (n) tends to infinity ’’suitably fast” then one can even expect to find in 
M(n, d) a complete subgraph of order r depending on n, i.e. r = r(n) ^ oo 
(’’suitably slowly”) as n -> oo. Indeed, it can be shown that for d = O(n^), 
where a — a(n, r) — )« 1 as n — > oo, a random graph M(n, d) contains a.s. a 
complete subgraph of order r = r(n), where 

Furthermore, if y(n) = cn^, where c is a constant, 0 < c < 1, then M{n,d) 
already contains complete subgraphs of order O(logn). As a matter of fact, 
we have the following result ( [xj denotes the largest integer not greater than 
x). 



Theorem 3.1 Let L{n,d) be the order of the largest complete subgraph in 
M{n,d). If d = cn, 0 < c < 1, then for every e > 0 

[r* — ej < L{n, d) < [r* + ej a.s. 

where 

r* ^ log„ n - log,, log„ n + log„ e + 1 (3.2) 

and a = Ifc. 
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Proof. It is well-known (see e.g. Bollobas (1985)) that the order L{n,p) of 
the largest complete subgraph in G{n,p), where p is a constant, satisfies 

[t* — ej < L{n,p) < [t* -f ej 



a.s., where £ > 0, 

= 2 logfc n-2 logft log(, n -1- 2 log^ e/2 -h 1 (3.3) 

and b — 1/p. The proof of this fact uses the first two moments of the number 
of complete subgraphs in G{n,p). Consequently, putting in (3.3) 



p = 



d 

n 



2 



we arrive at the thesis. □ 

Although the last result has an asymptotic character, it appears that in some 
cases it gives appropriate results for small values of n as well. For example, 
if n = 50 then for d = 20, 30 and 40 we have r* = 4.78, 6.63 and 10.18, 
respectively, which confirms with the data in Table 1. 

Now let us determine possible orders of cliques in a random match-graph 
M{n,d). Unfortunately we cannot directly apply known results about 
cliques for a classical random graph G{n,p), since the property that a graph 
has a clique is a global property. Nevertheless we can prove the following 
result. 

Theorem 3.2 Let d = cn, where c is a constant, 0 < c < 1 and r* is given 
by (3.2). Assume that e > 0 is fixed and u{n) is function tending to infinity 
arbitrarily slowly as n — )• oo. Then for any integer r such that 

logi/c n + o;(n)J <r< [r* - eJ (3.4) 

a random match-graph M(n,d) contains an r-clique a.s. 

Proof. The upper bound on r in (3.4) follows from Theorem 3.1, since 
obviously the largest complete subgraph forms a clique. In order to prove 
our result, we use a method described in Palka et al. (1986). It is known 
(see Bollobas, Erdos (1976)) that for t such that 2 < t < and a 

constant p 

Var{Wt) E{W() 

E{Wty E{Wt)^ ^ ’ ’ 

where Wt is the number of t-complete subgraphs in G{n,p) and t* is given 
by (3.3). Thus, by Chebyshev’s inequality, we have 

Pr{Ar > .9T;(A:,)} = 1 - o(l) . 



(3.5) 
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Now, if /: > 0 is fixed, then (2.4) implies that 

Sk ~ [A(n,r)]VA:!, 

where 

A(n,r) = n[d/nY^ = nc^'^. 

Also A(n, r) 0, for any 

It / X 

r > -\ogy^n + u{n) . 

Consequently, by (2.3), 

E{Yr) = E{Xr){l-0{l)). 

Thus by Markov’s inequality 

Pr{A, - n < .1£;(X,)} = 1 - o(l) . 
Combining this and (3.5) we deduce that 

Pr{i;>.8£;(n)} = l-o(l), 



which implies that 

Pv{Yr > 0} — 1 — o(l) , 

for all r satisfying (3.4). This completes the proof. □ 

Next let us examine the behavior of the expected value E{Xr) in a case 
when d is very closed to n. We have the following result. 



Theorem 3.3 Let d n — 1 — k where k >1 is a constant. Let 0 < c < 1 
be fixed and put 






Then 



lim E{Xcn) 



0, if/(A:,c)<l 

oo, if f(k, c) > 1. 



Proof. Ud = n — 1 — k and r — cn then 



E{Xr) = 



n 

r 



[n - r)k 
{n - l)fc 



(3.6) 



By Stirling’s formula 



n 



\cn 



Y^27tc(1 — c)n 
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and simple calculations show that 
■(n - cn)k 



(n- l)fc 

Consequently, by (3.6) we have 



(1 — exp{fcc(l — {kc — 1)/(1 — c))/2} . 



£(X„) ~ exp{tc(l- (fa-l)/(l-c))/2} _ 

y 27tc(1 — c)n 



which completes the proof. □ 
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Abstract: In many fields of signal processing feed- forward neural networks, es- 
pecially multilayer perceptron neural networks, are used as approximators. We 
suggest to state the weight adaptation process (training) as an optimization pro- 
cedure solving a conventional nonlinear regression problem. Thus the presented 
theory can easily be adapted to any similar problem. 

Recently it has been shown that second order methods yield to a fast decrease 
of the training error for small and medium scaled neural networks. Especially 
Marquardt’s algorithm is well known for its simplicity and high robustness. 

In this paper we show an innovative approach to minimize the training error. We 
will demonstrate that an extension of Marquardt’s algorithm, i.e. the adaptation 
of the increasing/decreasing factor, leads to much better convergence properties 
than the original formula. Simulation results illustrate excellent robustness con- 
cerning the initial values of the weights and less overall computational costs. 



1 Introduction 

Multilayered feed-forward neural networks are employed in an increasingly 
number of applications since they offer an easy and universal approach to 
implement an input-output mapping based on a given set of samples. How- 
ever, the problem of training the network is a challenging task. It can be 
structured as an unconstrained nonlinear optimization problem. In the past 
forty years the subject of numerical techniques for optimization has been 
extensively researched and highly developed. 

For small and medium scaled neural networks it has been found com- 
putationally efficient to apply second order methods. Since training of 
feed-forward neural networks leads to a nonlinear least squares optimiza- 
tion problem it is advantageous to use only an approximation of the Hes- 
sian matrix. By this means the calculation of second derivatives can be 
avoided. The characteristics of the training problem cause the need of Lev- 
enberg’s regularization (Levenberg (1944)). A robust algorithm for the adap- 
tion of the Levenberg parameter, that is easy to implement, was found by 
Marquardt (1963). Lately such second order methods are more and more 
employed for training neural networks (e.g. see Hagan, Menhaj (1994) and 
Demuth, Beale (1994)). 

In this paper we first state the optimization problem and show some char- 
acteristics in order to give strong reasons for the selection of second order 
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training methods. Then we describe Marquardt’s algorithm and state our 
novel extension to it. Experimental results will conclude the paper. 

2 Supervised Training: A Nonlinear Least 
Squares Problem 

Let US assume that we want to find an approximation to a function t/(x), x G 
and y G using a multilayer perceptron (MLP) neural network (e.g. 
Cichocki, Unbehauen (1993)). With Mq and Ml the numbers of neurons in 
the input and output layer are determined, respectively. Here we choose 
only one hidden layer (L = 2). All layers except the output layer get an 
additional offset neuron with an unchanged activation of one. So we obtain 
an output signal y G R^^ 

y{x,w) = , 

where the weight or parameter vector w includes all the entries of the weight 
matrices Xe and denote the extended versions of x 

and t/? defined by 




Now we want to approximate an unknown underlying function y{x). All we 
have is a set S of P (maybe noisy) samples of y{x) 




with x^P^ an input vector and the desired output vector, that 

corresponds to x^^\ Let us call w = w* the optimal parameter vector if w* 
minimizes the error or objective function 

E{w, S) := i ^ . (1) 

^ p=l 

The notation is simplified by defining the residual vectors 

:= y^^^ — y(x^^\w) 

r{w^ S) := 



and 



T 
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Then (1) results in 



and we receive the unconstrained optimization problem 



find 



w* I E{w*) = nnn | 



(2) 



( 3 ) 



The whole battery of derivative-free, gradient, conjugate gradient, secant 
and Newton-type methods can be applied (e.g. Dennis, Schnabel (1996)). 
This leads to many first and second order training algorithms (cf. 
Battiti (1992) and Hagan, Menhaj (1994)). 

In order to select an appropriate method we have to discuss some characteris- 
tics of the unconstrained optimization problem (3). Owens and Filkin (1989) 
found that (3) is ill-conditioned for a wide range of w when using a MLP 
structure. Therefore gradient based methods show slow progress in reducing 
the error after a while. To guarantee a high convergence rate standard lit- 
eratures in numerical mathematics (e.g. Dennis, Schnabel (1996)) suggest 
to involve the second order information and employ a Newton-Raphson, 
Gauss-Newton, secant or at least a conjugate gradient method. 

Since we consider a least squares problem (3) it turns out dextrous in ap- 
plying the Levenberg modification (Levenberg (1944)) of the Gauss-Newton 
formula 



w[n + 1] = w[n] — {w[n])J {w[n]) + A[n]/j g{w[n])^ (4) 

with the gradient vector g{w) := [|^('^)] 5 fhe Levenberg parameter A 
and the identity matrix /, because in (4) we only have to evaluate the first 
order derivatives of the Jacobian J{w) := ^{w). Furthermore we are able 
to guarantee a positive definite approximation J'^{w)J{w) of the Hessian 
matrix for A > 0. This enables us to apply the computationally efficient 
Cholesky decomposition for solving (4). 

A main problem that remains is to find an appropriate adaptation of A[n]. 



3 The Marquardt Algorithm and its Novel 
Extension 

There are several algorithms to adapt A. Some are motivated by the trust 
region approach (cf. Dennis, Schnabel (1996)). In our experience the Mar- 
quardt algorithm (Marquardt (1963)) works even better in the context of 
training MLPs. The proposed adaptation of the Levenberg parameter A 
reads as follows: 
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M.l Choose an initial A[0] and a value for the increasing/decreasing factor 
/? > 1 (e.g. 0.01 and 10, respectively) 

M.2 Denote the initial error as E[0] 

M.3 Repeat the following steps until termination 

M.3.1 Find the first element m out of the (ordered) sequence — 1, 0, 1, . . . 
so that E{l3^ X[n]) < E[n] is satisfied ^ ^ 

M.3. 2 A[n + 1] /?"^A[n]. Denote (the just evaluated) E{P'^ X[n]) as 

E[n + 1] 

It is important to note that if the first trial using m = —1 fails the system of 
equations (4) has to be evaluated (at least) once more. This is in particular 
the case in the later phase of the optimization process. (Fig. 1 shows a 
typical graph.) Therefore it is computationally very attractive to adjust /? 
in an appropriate manner so that the first trial is accepted in most of the 
optimization steps and no so-called backtracking steps are necessary. 

A straightforward strategy is to decrease (3 when exactly one backtracking 
step (m = 0) is used and to increase (3 in the case of none or more than one. 
For stability reasons we have to guarantee /3 > 1. So we can formulate the 
“/^-adaptive” version of Marquardt’s algorithm as follows (the italic letters 
indicate changes to the original version): 

A.l Choose the initial values A[0] and /?[0] (cf. M.l) 

A. 2 Choose the additional (fixed) increasing /decreasing factors Sq, S\ and 
Sn with 5i < 1 and Sq, Sn > I (e.g. Sq := 1.1, Si := 0.8, Sn := 1.2) 

A. 3 Choose a lower bound b > 1 for (3 (e.g. b := 1.1) 

A. 4 Denote the initial error as £^[0] 

A. 5 Repeat the following steps until termination 



A. 5.1 Find an appropriate step and set A[n + 1] 


and E[n + 1] (according 


to step M.3.1) 




A.5.2 




r 1 + {p[n] - 1) So 


: m = — 1 


P[n + 1] := •! 1 + {/3[n] - 1) Sj 


: m = 0 


[ 1 + {I3[n] - 1) s„ 


: m > 0 


A. 5. 3 if P[n + 1] < 6 then /3[n + 1] = 6 





^Here we consider E as E{X) = E{w{X)), where w{X) is the solution of (4) for a 
specific A. 

^The convergence of this iterative process is guaranteed for this training task since for 
a sufficient large A a small step in the gradient descent direction is made. 
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This control technique for j3 leads to a much higher acceptance of the first 
trial (M.3.1: m = —1) of Marquardt’s formula especially when the opti> 
mization procedure moves into regions with slow progress in decreasing the 
objective function. 



4 Experimental Results 

In order to get an idea of the performance of the novel “/3-adaptive” Mar- 
quardt algorithm used for training MLPs we consider two examples. The 
first one is taken from Hagan, Menhaj (1994), where the superiority of the 
Gauss-Newton method (applying Marquardt’s algorithm) over backprop- 
agation with momentum and a conjugate gradient technique is outlined. 
Therefore we focus on a comparison between the original and the adaptive 
Marquardt algorithm. The second example deals with a real-world applica- 
tion: the approximation of the inverse of the family of characteristics of a 
turbidity sensor. 

For all simulation examples we apply an MLP with one hidden layer. We 
use sigmoidal nonlinearities in the hidden layer and a linear output layer. 
The input and the hidden layers contain an additional offset neuron. The 
initial weights generated are uniformly distributed in [—2, 2] and normalized 
with the algorithm proposed by Nguyen and Widrow. A weight adaptation 
after presenting all patterns (“batch-mode”) is performed. Furthermore we 
work with an initial Levenberg parameter A[0] = 0.01 and an (initial) in- 
crease/decrease factor /?[0] == 3 for the (adaptive) Marquardt formula. 



4.1 Problem 1: 2D-sin(x)/x-function 

For the first problem, the approximation of the two dimensional function 



y{xi,x2) 



sin(7T \Jx\ + xl) 
-K^Jx\ + xl 



we use a 2:15:1 MLP. The training set consists of 289 pattern vectors 

equally scattered on a grid Xi,X 2 G [—2.5, 2.5]. The 
tested parameter combinations are shown in Tab. 1 and b was fixed to 1.1. 
All combinations are tested with (the same) 50 (different) initial weight sets. 
The training is terminated after the 100^^ batch step. ^ Fig. 1 shows the 
typical progress of the training error subject to (3) versus the number of 
batch steps for the various parameter combinations whereas in Fig. 2 the 
error is plotted versus the number of used floating point operations. The 
final number of operations is given in the upper half of Fig. 2. It is worthy 
to note that in this case the runs applying the novel algorithm (curves b-g) 
need much less computational effort then the original (curve a, solid line) . 



^ Using this termination criterion we get around the problem of categorizing tests that 
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Table 2 presents the means ji and standard deviations a of the final error 
and the final number of fioating point operations. 



case 


final 


flops 


final 


error 




N 




Me 


CTe 


a 


3.40 10® 


1.1810® 


53.6 


3.42 


b 


3.03 10® 


2.0710^ 


53.0 


4.14 


c 


3.09 10® 


1.81 10^ 


52.3 


4.36 


d 


2.78 10® 


2.2010^ 


52.8 


4.36 


e 


3.07 10® 


2.8610^ 


52.1 


4.33 


f 


3.05 10® 


2.7710^ 


50.4 


5.61 


g 


3.06 10® 


2.75 10^ 


51.3 


4.97 



Table 2: Mean /i and standard deviation a over 50 runs 

We may conclude that the novel approach reduces the computational costs 
up to 18% while reaching at least the same error goal. The values of the 
parameters 5 q, Si and can be chosen in a wide range (within specific 
limits). 

4.2 Problem 2: Turbidity Sensor 

Let us consider a sensor that outputs three voltages xi^X 2 ^xs that express 
the current turbidity t (for details see Lendl et al. (1998)). In a sequence of 
tests we find a set S of P == 51 training patterns 

Now we seek an approximation i{x,S) to the inverse sensor system. For 
this simulation a 3:3:1 MLP and the same parameter combinations as in 
Problem 1 are used. Table 3 shows the statistics for 100 runs. 



case 


final fiops 


final 


error 








Me 




a 


7.89 10® 


9.96 10^ 


7.86 10-® 


2.8310-2 


b 


6.90 10® 


6.2310® 


7.82 10-® 


2.8510-2 


c 


6.99 10® 


6.84 10® 


7.78 10“® 


2.8410-2 


d 


6.2310® 


5.95 10® 


7.80 10-® 


3.0310-2 


e 


7.33 10® 


6.5010® 


7.86 10-® 


2.84 10-2 


f 


7.52 10® 


7.5610® 


7.85 10-® 


2.82 10-2 


g 


6.72 10® 


7.5410® 


7.80 10-® 


2.83 10-2 



Table 3: Mean /i and standard deviation a over 100 runs 

This example also demonstrates the superiority of the adaptive version over 
the original one. It results in a reduction up to 20% of the numerical expense. 



do not reach a pre-specified error bound. 
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5 Conclusions 

Since multilayer perception (MLP) neural networks are extremely well suited 
for approximation tasks, they are more and more used for signal processing 
tasks. One important problem, the finding of optimal weights, remains. 

In this paper the training process for an MLP is stated as an ill-conditioned 
least squares optimization problem. Taking these characteristics into ac- 
count the Levenberg modification of the Gauss-Newton method appears 
favourable to solve this task. In fact Hagan and Menhaj (1994) showed 
its superiority in comparison with other training algorithms. 

A further improvement can be reached by applying a novel adaptation 
technique for the Levenberg parameter. Whereas Marquardt’s proposition 
(Marquardt (1963)), as applied in Hagan, Menhaj (1994), works with a fixed 
increasing/decreasing factor we demonstrated that an appropriate change of 
this factor leads to a significant decrease of the overall computational costs. 
The additional effort for programming is negligible. Experimental results 
are given. 
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Abstract: Formal Concept Analysis is a mathematical method of qualitative 
data analysis. The present experiment was aimed at finding out if and how far 
graphical representation tools of Formal Concept Analysis can be used to support 
choice decisions. This new approach is mainly descriptive and intends both to 
increase the large number of practical applications of Formal Concept Analysis and 
to enrich the cognitive-psychological field of Decision Analysis by mathematical- 
algebraic aspects. 



1 Theory 

1.1 Formal Concept Analysis 

Formal Concept Analysis (Ganter, Wille (1996)) is a mathematical method 
for the formal description and graphical representation of data. It is based 
on the theory of ordered sets, and especially of complete lattices. Formal 
Concept Analysis starts from the notion of a formal context, which consists 
of a set of objects, a set of attributes and a relation between these two sets; 
i. e. certain objects have certain attributes. Such contexts can be visualized 
by incidence tables, which will be called cross tables here. 

A selection of certain objects, together with the attributes shared by all these 
objects, constitutes a formal concept if there is no other object comprising 
all these attributes. The concept lattice of the context is the set of all 
these concepts, structured by the subconcept-superconcept hierarchy. (The 
subconcept covers only part of the objects of the superconcept; vice versa 
the superconcept contains only part of the attributes of the subconcept.) 
Concept lattices can be graphically represented by line diagrams. Each 
concept is visualized by a node in the plane in such a way that the node 
of each subconcept 5 of a concept C is below the node of C (straightly or 
diagonally). If there is no other concept between S and C, both nodes are 
connected by a line. The concepts induced by single objects are labelled 
with the object names, the attribute concepts with the attribute names, 
respectively. The context can always be reconstructed from the line diagram 
because an object o has an attribute a if and only if there is an ascending 
path leading from the node labelled o to the node labelled a. 
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1.2 Line Diagrams for Decision Support 

For static well-defined choice decisions among several alternatives and with 
respect to several attributes, all information relevant to the decision-maker 
can be concentrated in an alternative-attribute-matrix. The row belonging 
to a certain alternative contains the features of all attributes with respect 
to this alternative. If every attribute is dichotomous (z. e. an alternative 
can either have or not have this attribute), the alternative- attribute-matrix 
immediately becomes a cross table. Since cross tables can always be inter- 
preted as formal contexts. Formal Concept Analysis is applicable here. The 
equivalence of contexts and concept lattices and the graphical advantages of 
line diagrams have led to the idea to use line diagrams for decision support. 



2 Questions and Hypotheses 

The present experiment was intended to yield first experience concerning the 
suitability of Formal Concept Analysis diagrams for the practical purpose of 
decision support. An essential problem has been the possible susceptibility 
of decision making to variations of data representation. 

Since the suitability of Formal Concept Analysis diagrams as a decision aid 
had not been tested so far, the main question was: 

Can line diagrams of concept lattices be used for decision support? 

Linked to that are many more questions: 

1. How quickly can novices in F. C. A. learn to read line diagrams? 

2. How competent in dealing with line diagrams do the subjects become? 

3. Which is the tool of preference for representing the information rele- 
vant to the decision: line diagrams or cross tables? 

4. Can different presentations of the same information lead to different 
decisions? 

Connected to this are the following subquestions: 

- Are the alternatives in the lower part of the line diagram intu- 
itively believed to have more attributes than they really have? 

- Can decisions be manipulated in favour of certain alternatives by 
arranging them in the lower part of the line diagram? 

- Which role do the preferences of the decision-maker play? 

From these questions, the following hypotheses have been derived: 

HI: Because of their graphical representation, line diagrams are at least 
equivalent to cross tables as tools of information presentation. 

H2: Inexperienced people at first have difficulties in reading line diagrams. 
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H3: The way of representation (the distortion) of a line diagram influences 
the results of decisions made on the base of this line diagram. 

H4: An alternative is preferred by the decision-maker if it appears optically 
in the lowest part of the line diagram. 

H5: Decision time decreases from decision to decision. 

H6: The recognition of the formal symmetry in the decision tasks is in- 
versely related to the distortion of the line diagrams. 

H7: Symmetry is increasingly better recognized in the course of the exper- 
iment (learning effect). 

3 Methods 

3.1 Sample and Decision Tasks 

The subjects were 36 undergraduates at an average age of 21.5 years. None 

of them had previous knowledge of Formal Concept Analysis. 

The subjects had to solve five decision tasks. Part 1 consisted of the tasks 

“Computer”, “School way” and “Swimming pool” while the tasks “Appli- 
cant” and “TV set” formed Part 2. 





Attribute I 




1 


2 


3 


4 


Alternative 1 




X 


X 


X 


Alternative 2 


X 




X 


X 


Alternative 3 


X 


X 




X 


Alternative 4 


X 


X 


X 





Table 1. Common structure of the decision tasks in Part 1 

In Part 1 the subjects were given three optically different line diagrams of 
the same data set: Each alternative has exactly three of the four attributes, 
with a different attribute missing in all four cases. The formal context in 
table 1 shows this formal symmetry. 

To control a possible learning effect, the subjects were additionally balanced 
across the order and the representation of the decision tasks ( “Computer” - 
choose the ecologically best model, “School way” - choose the safest path, 
“Swimming pool” - choose the pool with the most opening days). Each 
subject went through three different decision tasks in three diflFerent repre- 
sented diagrams. So a large number (N=36*3=108) of decisions concern- 
ing the same task structure could be jointly evaluated, which increased the 
power of the test. With a view to clearly attributing differences in decision 
behaviour to differences in data representation, the task-specific influence 
of the decision-makers’ individual preferences had to be largely eliminated. 
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This was achieved by means of two things: first, a decision task with a clear 
aim (as objective as possible) and, second, structural symmetry including 
the explicit statement that certain attributes have to be regarded as equiv- 
alent. It was noted whether the subjects recognized this symmetry. Fig. 1 
shows line diagrams for two of the decision tasks in different distortions. 




Figure 1: Decision tasks of Part 1: left - “Computer”, symmetric represen- 
tation, right - “Swimming pool”, distorted to the right 

Contrary to Part 1, in Part 2 the individual preferences of the subjects 
could influence the decisions. Here, the preferences were covariables and 
ascertained in a questionnaire. The procedure of attributing possible differ- 
ences in the decision result (here consisting in a hierarchy of all alternatives) 
solely to the varying representation of data is justified by the assumption 
that the preferences in the different groups are equally distributed. Here, 
too, the sample was randomized over the distortions of the line diagram. 

3.2 Course of the Experiment 

After a short introductory questionnaire, the subjects were given the written 
instruction “How to read line diagrams”. Then they were shown a partic- 
ular line diagram and asked five questions that could only be answered by 
correctly reading the information from the diagram. When the answers were 
wrong, they were given further, oral, instruction and five more questions. 
For each of the five decision tasks, the subjects received work sheets with 
short texts describing the tasks, and the corresponding line diagrams. 
Finally, in a questionnaire, the subjects were asked to estimate the suitability 
of line diagrams for the clear presentation of information, absolutely and 
compared to cross tables, and to give reasons for their opinion. 
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3.3 Design of the Experiment 

The questions and hypotheses required the following variables. 

• Independent variables: 

- Decision task (in Part 1 three different tasks, in Part 2 two) 

- Way of distortion (in Part 1 three distortions, in Part 2 two in each 
task) 

Part 1 symmetric left distorted right distorted 

Task A: Computer 12 decisions 12 decisions 12 decisions 

Task B: School way 12 decisions 12 decisions 12 decisions 

Task C: Swimming pool 12 decisions 12 decisions 12 decisions 

Part 2 Distortion 1 Distortion 2 

Task ‘Applicant” 18 decisions 18 decisions 
Task “TV set” 18 decisions 18 decisions 

• Dependent variables: 

- Decision result (preferred alternative in Part 1, full alternative ranking 
in Part 2) - hypotheses 1, 3 and 4 

- Decision time (time needed to reach a decision) - hypotheses 1, 2 and 5 

- Recognition of symmetry (only in Part 1) - hypotheses 2, 3, 6, 1 

• Controlled variable: 

- Position of the task in the course of the experiment (in Part 1 positions 
1, 3, 5, in Part 2 positions 2 and 4) 

4 Results and Discussion 

4.1 Part 1 of the Experiment 

4.1.1 Recognition of Symmetry 

The formal symmetry of the tasks of Part 1 was recognized by half the 
subjects on average (cf. table 2). Pearson’s chi-square-test (Lienert (1962)) 
finds no significant differences between the tasks (Xd /=2 = 2.3; p > 0.05). 

A check of the hypothetical connection between distortion and recognition 
of symmetry yielded no significant differences as well (Xd /=2 — 2-3; p > 0.05; 
cf. table 3). For half the decisions with (left or right) distorted represen- 
tation the symmetry was recognized whereas the rate of recognizing in the 
symmetric case (56%) was slightly but not significantly higher. 

A connection between task position and recognition of symmetry could not 
be proved as well (Xd /=2 = 2.7; p > 0.05). 
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Decision task 


Symmetry 


Computer 


School way 


Swimming pool 


not recognized 


21 


16 


15 


recognized 


15 


20 


21 



Table 2: Frequencies of recognition of symmetry with respect to the task 





Way of distortion 


Symmetry 


symmetric 


left distorted 


right distorted 


not recognized 


16 


15 


21 


recognized 


20 


21 


15 



Table 3: Frequencies of recognition of symmetry with respect to distortion 



4.1.2 Decision Time 

With regard to decision time, the tasks of Part 1 did not differ (x^ = 
0.78; p > 0.05). That bears out the formal equality of task difficulty. 

The average decision time decreased from 4.28 min (pos. 1) via 3.33 min 
(pos. 2) to 3.06 min (pos. 3), an analysis of variance confirms the significance 
of this result {Fdf =2 = 6.6; p < 0.01). This expected learning effect was 
controlled by balancing both independent variables across all positions. A 
connection between decision time and distortion of the line diagrams was 
not established {U = 1238.0; p > 0.05). 



4.1.3 Decision Result 

The influence of the decision-makers’ subjective preferences was so large that 
an evaluation of the decisions referring to the independent variables was not 
feasible. This is illustrated for the task “Computer” (cf. table 4). 



chosen alternative (s) 


A 


B 


C 


D 


B, C 


B, D 


A, B, C, D 


number of subjects 


3 


12 


7 


8 


1 


1 


4 



Table 4: Decision result for task “Computer” (N=36) 



The fact that only four subjects considered all alternatives to be equivalent, 
though 15 had recognized the formal symmetry (and implicit equivalency) 
of the alternatives, is evidence of the dominance of subjective preferences 
as decision criterion. In the present case, alternative B was often preferred 
because it lacks the attribute “Stand-by function”, which many subjects 
regarded as not at all ecological. Though the alternatives were constructed 
to be formally equal, the decision-makers had enough possibilities to make 
the decisions according to their own subjective criteria (cf. table 5). 
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alt. 


W 


R 


M 


I 


W, R 


W, I 


R, M 


M, I 


W, M, I 


W, R, M, I 


N 


2 


4 


5 


7 


2 


1 


1 


1 


1 


12 



Table 5: Decision result for task “Swimming pool” (N=36) 



4.2 Part 2 of the Experiment 

The line diagrams in the task “Applicant” of Part 2 were distorted in a 
way that applicants A and C are in the lower part of the line diagram with 
distortion 1, while B and C are in the lower part with distortion 2. On 
the basis of the hypotheses it might have been expected that applicant A 
would perform significantly better and applicant B significantly worse with 
distortion 1 than distortion 2. This has not come true. With distortion 1, 
applicant A was 3.5 times at rank 1, with distortion 2 twice. Applicant B 
was 14.5 times at rank 1 with distortion 1 and 15 times with distortion 2. 
Possibly because of their “cubic form” , the line diagrams were too obvious 
to infiuence the decision-maker optically. 

In the task “TV set” , for the more specific hypothesis that the TV sets 4 and 
5 would do better with distortion 2 than with distortion 1, no significance 
could be observed as well; but table 6 shows a slight tendency. 



Distortion 


TV set 


Rank 1 


Rank 2 


Rank 3 


Rank 4 


Rank 5 


1 


4 

5 




4 times 


5 times 
1 time 


8 times 


1 time 
17 times 


2 


4 

5 


2 times 
1 time 


5 times 


4.5 times 
3 times 


4.5 times 
7 times 


2 times 
7 times 



Table 6: Frequencies of rankings for TV sets 4 and 5 



4.3 Evaluation of the Questionnaire 

On a rating scale from 0 to 10, the subjects rated the suitability of line 
diagrams for a clear knowledge representation at 5.6 on average (standard 
deviation 2.3), where figures reached from 1.3 to 9.1. The fact that 10 
subjects rated below 3.5 and 16 subjects 6.5 or higher, clearly divides the 
sample into two groups. Seven subjects preferred line diagrams, 21 gave 
preference to cross tables, and eight voted them equally suitable. 

Of interest are the sex-specific differences. Five of the 13 male subjects opted 
for the line diagrams. The female subjects were more moderate (all eight 
indifferent judgements came from females) and also more conservative (l3 
preferred cross tables and only two line diagrams). 
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4.4 Answers to the Questions 

The equivalency of line diagrams and cross tables as tools of knowledge 
representation (HI) cannot be derived from our data. 

The hypothesis of the starting problems in dealing with line diagrams (H2) 
was absolutely confirmed. Here, too, variance is very large. Some subjects 
understand the theory of Formal Concept Analysis at once, others need some 
training, and some will probably prefer cross tables forever. A connection 
between distortion of the line diagram and decision result (H3) was observed, 
but not statistically proven. The results concerning possible manipulability 
of decisions (H4) are similar. Tendencies were found, but the hypothesis was 
not confirmed. The majority of subjects is not deceived by optical aspects 
and reads the relevant information correctly from the line diagrams. The 
assumption of a reduction of decision time in the course of the experiment 
(H5) is confirmed, so there is a training effect in reading line diagrams. The 
hypotheses about better recognition of symmetry in dependence of distortion 
(H6) and position (H7) cannot be confirmed from the acquired data. 

5 Summary 

The main problem of the present investigation was the difficulty in con- 
structing symmetric decision tasks where finding the best result does not 
depend on individual preferences. Though only few results confirmed the 
hypotheses, fresh knowledge was achieved. A first test of Formal Concept 
Analysis as a decision-supporting instrument has been done; it has shown 
that line diagrams can clearly present information. Further investigation is 
required to find out if this holds also for the more general decision-making 
situation where the attributes are not dichotomous (the cross table becom- 
ing an alternative-attribute-matrix with arbitrary features). A good start- 
ing point for this are many- valued contexts which can be transformed into 
single-valued contexts by conceptual scaling. 

In working on paper, the line diagrams must be drawn before the decision 
and cannot be modified later on. A way out could be the TOSCANA data 
management system (Kollewe et al. (1994)) for dynamical navigation in 
complex data structures. Further studies are needed to find out how that 
system can support decision-making. 
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Abstract: This contribution demonstrates the possibility to achieve a Bayesian 
(or nearly Bayesian) classification of exponentially distributed data by percep- 
trons with at most two hidden layers. The number of hidden layers depends on 
how much is known about the sufficient statistics figuring in the corresponding 
exponential distributions. A practical applicability is illustrated by classification 
of normally distributed data. Experiments with such data proved that, in the 
learning based on correct classification information, the error backpropagation 
rule is able to create in the hidden layers surprisingly good approximations of 
apriori unknown sufficient statistics. This enables the trained network to imitate 
Bayesian classifiers and to achieve minimum classification errors. 



1 Introduction 

Classifications of random data, in particular of samples of random signals, is 
an important procedure of mathematical statistics (see e.g. Hand (1981)), 
pattern recognition (e. g. Devijver and Kittler (1982)), control (e. g. Decleris 
(1991)), speech processing (e.g. Morgan and Scofield (1994)), etc. Optimal 
classifiers are known for many probabilistic sources of random variables, 
random vectors and random functions. In this paper we are interested in 
neural net realizations of optimal classifiers. The main advantage of such 
classifiers is that they usually require only a partial (or no) knowledge of the 
involved probabilistic sources. 

The principial possibility of optimal classification by means of multilayered 
perceptrons has been established in the classical works of Hornik et al (1989), 
White (1990) and Specht (1991). The networks proposed by these authors 
adapt themselves with the aim to approximate the optimal classifier at all 
points of the observation space. In some sense the network “learns” proba- 
bilities at all points of the observation space. If however, as in the case of 
observation on processes, the size of sampling (the size of observation win- 
dows) increases then the observation space is becoming too complex. The 
result is a too slow adaptation of the corresponding networks and, in fact, 
there is no hope to “learn” the probabilities at all sample trajectories in a 
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reasonable time. Obviously, in the case of processes the network architecture 
and dynamics should be based on specific features of processes rather than 
on their trajectories. The problem faced here is thus a kind of the feature 
extraction problem which is well known in the pattern recognition (see e. g. 
Devijver and Kittler (1982)). 

An effective optimal Bayesian discrimination of objects from two different 
stochastic sources by means of a multilayered perceptron with appropriate 
features at the input been established by Ruck et al (1990). Approxima- 
bility of posterior probability when the sources are multivariate Gaussian 
has been proved recently by Funahashi (1998). He used the components 
of the quadratic sufficient statistics of Gaussian distributions in the role of 
features. Our paper is close in spirit to this paper. 

As shown in Vajda (1996) and Vajda and Vesely (1997), the so-called asymp- 
totic discrimination information rate is an effective tool of feature extraction. 
It enables to reduce for common signals the number of hidden layers and 
states, as well as synaptic weights and transfer functions of perceptron net- 
works without reducing their ability of optimal discrimination. The asymp- 
totic discrimination rate has been evaluated in recent literature for number 
of random signals. Therefore, the applicability of the networks described in 
these papers is quite universal. 

This paper presents results for exponentially distributed random processes 
or, more generally, for arbitrary exponentially distributed data. Sources of 
exponentially distributed data are common in physics as well as statistics. 
They have been systematically studied e. g. in Brown (1996). Exponen- 
tially distributed random processes were studied recently in Kiichler and 
Sprensen (1989). The Gaussian signals, signals of the type of diffusion pro- 
cesses or processes with independent increments, and others for which is 
known the asymptotic discrimination rate mentioned above, are exponen- 
tially distributed. Therefore, the present approach is in some sense more 
general than in the papers cited above. We show that if a sufficient statistic 
of two exponential data models is known then the Bayes classification can 
be achieved by a neuron with m inputs representing the coordinates of the 
statistic. If the statistic is unknown, then the Bayes classification can be 
achieved by a perceptron with input x representing the data and two hid- 
den layers inserted between the input and the previously considered neuron. 
These layers form a subnetwork consisting of standard neurons, with input 
X and as many neurons in the second layer as the number of coordinates of 
the sufficient statistic. This subnetwork learns to approximate in the second 
layer all coordinates of the unknown statistic. In both cases the learning is 
assumed to use the error backpropagation rule. 



2 Theorem 

Let the probability distribution P of an observation X = (Xi, . . . , X„) with 
values X — (xi, . . . ,a;„) in jR" be from a family V = {Pq : 0 € 0} of 
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probability distributions on RP' with densities {pQ : 0 G 0} with respect to a 
dominating measure ii (Lebesgue measure if the distributions are continuous, 
counting measure if the distributions are discrete). The parameter space 0 
is supposed to be a subset of 

We consider the situation where data X are produced by one of only two 
possible sources from the family 'P, say Pq^ and Pq^. In this case the decision 
space D = {1,2} and losses 

/ 0 / A 2 > 0 if i = 1 

= and = ^ ^ 

lead to the decision problem which can be interpreted as the problem of 
classification of data according to their source of origin. Extension of our 
results to situations where the number of data sources and decisions exceeds 
2 will be obvious. 

Denote by tt^ the prior probability that data are produced by the source 
Pq , i = 1,2. Then, as well known, the Bayes discrimination 6* : 

{1,2} is defined by the condition 

S*{x) — diigmax (j){i) for (x), iG{l,2}. 

It follows from here that 




if Xi-KipQ^{x) 



> X2T^2VeS^) 

< ^2 7T2P0^(®). 



( 1 ) 



We shall suppose that the observations are exponentially distributed. This 
means that the parameter space 0 is an open convex subset of R^, and that 
there exists a mapping T : R^ RP such that 

Pq{x) = exp(0 • T(x) — '0(0)) for all 0 e Q, x e RP, (2) 

where n 

'ip{6) = \ii exp(0 • T{x)) d/i(x). 

We shall assume that the family (2) is not overparametrized, i. e. that (0i — 
02)-T{x) is not /i-almost everywhere constant. This means that, for different 
parameters 6 , distributions (2) are different. 

Note that (2) is a standard form of exponential distributions. The more 
familiar form 

p^(x) = a{0) b{x) exp(c(0) • T(x)) 

can be transformed into the standard form by the substitution c(0) 0 

and by a modification of /i (cf. Brown (1986)). 

In this paper we consider the multilayered perceptrons described in detail 
e. g. in Muller et al (1995). It is well known that such perceptrons can 
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approximate continuous functions defined in bounded domains. As shown by 
Cybenko (1989), Funahashi (1989) and Hornik et al (1989), for every closed 
and bounded subset S C and every continuous function 6 : S R and 
positive 6, there exists a perceptron with input x E R^^ one hidden layer 
with output s{x) and with linear response 

p{x) = s{x) • w 

such that 

sup \S{x) — p{x)\ < s. 
xes 

Remark 1. As shown by Lapedes and Farber (1988) (cf. also Sec. 6.4 of 
Muller et al (1995)), every “reasonable” function S : R^ R can be approxi- 
mated by a perceptron with two hidden layers of neurons. The “reasonable” 
means that S can be approximated e. g. by piecewise linear functions, or 
by basis-spline functions widely used in numerical analysis. An advantage 
of the method of these authors is that it is constructive, while the method 
of previously mentioned authors guarantees the existence but says a little 
about the construction of the desired perceptron. 

Now we can formulate the following result. 



Theorem: If the data are exponentially distributed then the Bayes dis- 

crimination (1) coincides with the response 

p{x) = 1 + l(_oo, 0 )(s(a:) • w). 

of the perceptron with n inputs x, one hidden layer of m 4- 2 units with the 
outputs 

s(x) = (T(x),l,l)€il™+2 (cf. (2)) 
and the output neuron with m + 2 synaptic weights 

w = {ei- 02, ^{02) - ^(0i), In(AiTri) - ln(A27T2)) G 



Proof: By (1), 6*{x) = 1 if and only if 



But according to (2) 



i„^ 

Pfl.(x) 



+ ln 



AlTTi 

A27T2 



> 0 . 



In (01 - 02) . T(x) + ^(02) - 



( 3 ) 



Thus (3) holds if and only if 5(x) and w considered in Theorem 1 satisfy 
the relation s{x) • w >0, i.e. if and only if p{x) = 1. Q 
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Remark 2; In most applications the losses and prior probabilities are 
considered to be symmetric, i.e. Ai = A 2 and tti — 7 ^ 2 - In such situations 
the dimension of above considered perceptrons can be reduced by one, i. e. 
it suffices to consider a perceptron with outputs s{x) — (T(x), 1) G 
and synaptic weights w — {61 — 62 ^ ' 0 (^ 2 ) — '0(^i)) ^ 



3 Example 



Let US consider the exponential family (2) with n = 1, m = 2 and with 
Lebesgue measure /x on R. Such a family is specified by two statistics Ti{x) 
and T 2 {x). Then, for 6 = 



'0(a, ^) = ln J exp{aTi{x) -i- (3T2 {x)) dx. 



Our attention will be restricted to the symmetric case of Ai = A 2 and tti = 
7T2. 

Consider the particular functions Ti(x) = x and T 2 {x) = — Then 0 = 
R X ( 0 , 00) and 



, 7T 



for all [a,P) € Rx (0, oo). It is easy to verify that then (2) is the normal 
family with means and variances 



Indeed, 




and 



1 

W 



PaA^) = 



^ax—(5x^ 




g-(a:-/i)2/2o-2 

\/27ra^ 



In the present model, taking into account Remark 2, the perceptron of our 
Theorem has parameters n — 1 and m — 2. The first two coordinates 5i(x) 
and 52 (x) of the output s(x) of hidden layer are realized by perceptrons with 
two hidden layers of neurons considered in Remark 1. The whole percep- 
tron thus consists of three layers two of which are hidden. This perceptron 
approximates the Bayes discrimination 5* (x) which is for 

01 = (ai, /?i) and 02 = (<^ 2 , P 2 ) 

defined in accordance with Theorem by 

5*(x) = 1 + l(_oo,0)(Ti(x) Wi + T 2 {x) W 2 + W 3 ), (4) 

where 

wi = {ai-a2), W2 = {Pi- p2)^ u;3 = '0(a2,/?2) - A)- (5) 
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Let X be the above considered normal observation corresponding to 

6i = (0, (2cti)“^) and 62 = (0, (2(72)“^) where 0 < ai < ct 2 , 

i.e. to = (0,cTi) and (/X 2 ,cr|) == (0,cr|). It follows from (4) and (5) 

that the Bayes discrimination is 



^ (^) — 1 + l(-(7o,a-o)(^)5 

where ao > 0 is solution of the equation Pq^{x) = i- e. 



( 6 ) 



CTO 



_ f (7^(7|ln(a2/cri)' 



,2\l/2 



This discrimination function is matched by the perceptron using the hidden 
approximations Si{x) and S2{x) to statistics Ti{x) = x and T2{x) = — 

The hidden perceptrons consisted of 3 neurons in each of the two hidden lay- 
ers, with the activation functions ip{x) = l/(l + exp(7a;/10)). The activation 
function of the output neuron was linear. Learning has been performed by 
using data (xi, 5i), . . . , (xat, 6 n) where 6i are independent and taking with 
equal probabilities the values 1 and 2 and Xi are realizations of random vari- 
ables distributed by pg for 6 = 65.. Hence the observations xi, . . . , Xjv can 
be viewed as independent realizations of a random variable with the mixed 
density 



v{x) = ^ {veS^) +P02W) 



1 

2\/^ V <^2 



( 7 ) 



In our experiments we used N — 2000. Data were simulated by using a tested 
pseudorandom generator. Learning of weights of all 7 neurons in the network 
was carried out by the standard error back-propagation algorithm described 
e. g. in Muller et al (1995) with the constant learning rate e = 0.007. 

The experiments were performed for Gi — \ and 14 different values of ct 2 , 
with the theoretical Bayes error varying between 0 and 0.5. For 

each pair (cti,( 72) we carried out 20 experiments with randomly selected 
initial weights. At the end of each experiment the learned network classified 
98 000 data randomly selected according to the same rule as the learning data 
and the error frequency was calculated. In all experiments this error 
was reasonably close to the smallest theoretically achievable error 
The deviation P|®^^ — of the perceptron was below 10“^. In half of 

these experiments it was arround 10“^ and in quarter of them close to 10~^. 
Detailed tables of values and pP®^^ _pSayes published in Vajda 

et al (1998). 

We were also interested in whether, or to what extent, the responses Si(x) 
and S 2 (x) of the hidden subnetwork approximate the desired functions 
Ti{x) = X and T2{x) = considered in (4). Of course, this approxima- 
tion is irrelevant (and cannot be achieved by any empirical means) outside 
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the effective domain of the distribution (7), where no or few data are realized. 
Also, since the output decision is based on a linear rule, the approximation 
must be investigated modulo linear transforms, i. e. arbitrary ai Ti{x) + bi is 
as good as Ti{x). We verified that in this sense a surprisingly good approxi- 
mation took place in all experiments. For quantitative details we agin refer 
to the forthcoming paper mentioned above. 



4 Conclusions 

The paper considers data x = (xi,...,x„) randomly generated by two 
sources from an exponential family with a sufficient statistic T{x) = (Ti(x), 

. . . ,Tm{x)). If T{x) is known then the optimal Bayesian classification of 
these data can be achieved by learning a neuron with m inputs Ti(x), . . . , 
T^(x), using the error backpropagation. If only the dimension m of the 
statistic is known, then the previous conclusion applies to a more compli- 
cated neural network of the perceptron type. Input of the network is x and 
the network consists of two hidden layer of neurons containing m neurons 
in the second layer, and of the previously considered neuron at the output. 
Experiments with normal data proved that this network can be practically 
applied with excellent results. 
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Abstract: Knowledge understood on the basis of Peirce’s Pragmatism can be 
activated by using the metaphor of landscape. This approach is outlined by dis- 
cussing conceptual landscapes of knowledge within the development of Formal 
Concept Analysis. Various tasks of knowledge processing are considered such as 
exploring, searching, recognizing, identifying, analyzing, investigating, deciding, 
improving, restructuring, and memorizing. For all these tasks examples of con- 
crete applications are given which show the fruitfulness of the landscape paradigm 
of knowledge; in most of those applications, the conceptual structures are imple- 
mented by using the management system TOSCANA. 

Contents 

1. The Landscape Paradigm for Knowledge 

2. Conceptual Data Systems and TOSCANA 

3. Tasks of Conceptual Knowledge Processing 

1 The Landscape Paradigm for Knowledge 

Methods of knowledge processing always presuppose, consciously or uncon- 
ciously, some understanding of what knowledge is. A dominant understand- 
ing views knowledge as a collection of facts, rules, and procedures (for ex- 
tending knowledge). But, for interpreting realities, such a view is insufficient. 
Since knowledge is constituted by human argumentation within the intersub- 
jective community of communication, the cognitive-instrumental rationality 
used for establishing knowledge must be extended to the communicative 
rationality as Jurgen Habermas argues in his “Theory of Communicative 
Action” (Habermas (1981); p. 28). There are even serious doubts concern- 
ing an objective foundation of reasoning when it is based on the cognitive- 
instrumental rationality. Karl-Otto Apel, one of the leading philosophers 
of Pragmatism, takes this critical view and, in claiming the change to the 
pragmatic paradigm^ he writes: 
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“In view of this problematic situation [of rational argumentation] 
it is more obvious not to give up reasoning entirely, but rather to 
break with the concept of reasoning which is orientated by the 
pattern of logic-mathematical proofs. In accordance with a new 
foundation of critical rationalism, Kant’s question of transcen- 
dental reasoning has to be taken up again as the question about 
the normative conditions of the possibility of discursive com- 
munication and understanding (and therewith discursive criti- 
cism too). Reasoning then appears primarily not as deduction 
of propositions out of propositions within an objectivizable sys- 
tem of propositions in which one has always abstracted from the 
actual pragmatic dimension of argumentation, but as answering 
of why-questions of all sorts within the scope of argumentative 
discourse.” (Apel (1989); p. 19) 

In Wille (1996) a restructuring of mathematical logic is suggested which 
locates reasoning within the intersubjective community of communication 
and argumentation. Only the process of discourse and understanding leads 
to comprehensive states of rationality which properly respect the pragmatic 
dimension of reality. That does not exclude logic-mathematical proofs, but 
they can be only part of a broader argumentative discourse. With his trans- 
formation of Kant’s transcendental philosophy to intersubjectivity, Charles 
Sanders Peirce, the founder of Pragmatism, has given a convincing philo- 
sophical foundation for the constitutive role of the intersubjective community 
of communication and argumentation (cf. Apel (1976a)). It applies not only 
to epistemology and knowledge, but also to intersubjective ethics which Apel 
has elaborated in his important treatise “The Apriori of the Community of 
Communication and the Foundations of Ethics” (Apel (1976b)). 

According to Peirce’s Pragmatism, the epistemological understanding of re- 
ality^ which is basic for human knowledge, consists of an unlimited process 
of investigation. In each instance of this process investigators can treat 
only questions of a necessarily restricted nature which heavily rely on com- 
mon sense, on views and purposes with respect to the considered realities. 
Therefore human knowledge is always incomplete and continuously requires 
intersubjective communication and argumentation for its formation. This 
fact has to be respected by any knowledge representation and processing. 
The pragmatic understanding of growing knowledge may be illustrated by 
using the metaphor of landscape. Let us imagine how investigators explore 
and analyze a landscape, first of all in their immediate surroundings, some- 
times by longer expeditions, and in special cases even from the air. They use 
different kinds of locomotion, first more convenient ones, often those which 
require strength and tenacity. They make plans and designs, discuss aims 
and purposes, and work out evaluations and decisions. Finally, they write 
reports and draw maps which they interchange with other investigators to 
obtain common views and understandings. By these activities the landscape 
may be shaped and developed so that it can support a great variety of tasks 
and purposes. A common ground of the metaphor of landscape and the 
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pragmatic understanding of knowledge is the idea of continuity which, by 
Peirce’s epistemological semiotics, is essential for all human thinking (cf. 
Peirce (1931), Buchler (1955)). 

The idea of landscape is becoming increasingly influential in the fleld of 
knowledge representation and processing. Especially, the frequently used 
term of “navigation” suggests how this idea is becoming a leading metaphor. 
That view also is supported by the actual development of viewing computers 
as medium. This development shows that it is time for explicating the prag- 
matic landscape paradigm for knowledge processing. In this paper, we will 
concentrate only on conceptual landscapes of knowledge as they are viewed 
within the development of Formal Concept Analysis. As a basic notion we 
explain “conceptual data systems” together with the management system 
“TOSCANA” which allows us to implement conceptual data systems. In 
some detail, we discuss various tasks of conceptual knowledge processing 
which can be performed within conceptual landscapes of knowledge repre- 
sented by TOSCANA as conceptual data systems. 



2 Conceptual Data Systems and TOSCANA 

A conceptual landscape of knowledge designed for computational treatment 
needs a comprehensive formalization of concept which can widely support hu- 
man communication and argumentation concerning the represented knowl- 
edge. Formal Concept Analysis (Wille (1982), Canter and Wille (1996)) is 
based on such formalization, which reflects the understanding of a concept 
as a unit of thought constituted by its extension and its intension. This for- 
malization is possible by always starting with a (formal) context deflned as a 
triple (G, M, I) where G and M are sets and / is a binary relation between 
G and M (i.e. I C G x M); the elements of G and M are called objects and 
attributes^ respectively, and gim (4^ (g^ m) G /) is read: “the object g has 
the attribute m” . A (formal) concept of a context (G, M, I) is deflned as a 
pair (A, B) with A C G and B C M such that (A, B) is maximal with the 
property A x B C I. The sets A and B are called the extent and the intent 
of the concept (A, 5), respectively. The subconcept-superconcept-relation is 
formalized by (Ai, J5i) < (A 2 ,B 2 ) Ai C A 2 (4=^ Bi D B 2 ). The set 
of all concepts of a context (G, M, I) together with the order relation < is 
always a complete lattice, called the concept lattice of (G, M, I) and denoted 
by|a(G,M,/). 

For interpretation and communication it is important that concept lattices 
can be represented by (labelled) line diagrams in which the underlying con- 
text still can be recognized (cf. Wille (1989a)). In Figure 1, a context is 
described by a table in which the crosses represent the binary relation I be- 
tween the object set G (consisting of the former presidents of FRG) and the 
attribute set M. A line diagram of the concept lattice |9(G, M, I) is shown 
in Figure 2. 

In a line diagram of a concept lattice, the name of an object g is always 
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Figure 1: A formal context concerning the former presidents of the FRG 



attached to the little circle representing the smallest concept with g in its 
extent (denoted by 75); dually, the name of an attribute m is always attached 
to the little circle representing the largest concept with m in its intent (de- 
noted by /xm). This labelling allows us to read the context relation from the 
diagram because gim 4=^ jg < /xm. The extent and intent of each con- 
cept (A, B) also can be recognized because A = {g e G \ jg < (A, J3)} 
and B = {m G M | (A, B) < fim}. For example, the little circle in 
the line diagram of Figure 2 labelled with ‘Herms of office: 2” represents 
the concept with the extent {Heuss, Lubke, von Weizsdcker} and the intent 
{age of entrance :> 60, terms of office : 2}. 

Data are often given by object-attribute- value triples. In those cases the for- 
malization starts with a many-valued context defined as a quadruple 
(G, M, VF, I) where G, M, and W are sets and / is a ternary relation between 
G, M, and W (i.e. I C G x M x W) for which (p, m, xx;i), (^, m, 1x^2) ^ I 
implies Wi — W2 ; the elements of G, M, and W are called objects, at- 
tributes, and attribute values, respectively, and {g, m,w) E I (for which we 
often write m{g) = w) is read: “the object g has the value w for the at- 
tribute m”. To obtain a concept lattice for {G,M,W,I), the many- valued 
context is transformed into a formal context (also called a “one-valued” 
context) using the method of conceptual scaling (cf. [GW89]): For each at- 
tribute m a formal context := (Gm, Im) with m(G) C Gm is chosen 
which yields an appropriate conceptual structure for the attribute values 
of m; the derived context is then the triple (G, x Mm^J) with 

gj{m,n) :<=> m{g)ImTi and its concept lattice is the desired conceptual 
structure of the many- valued context {G,M,W,I), The choice of the con- 
texts S^, which are called conceptual scales, must be considered as already 
a first interpretation of the data coded by (G, M, W, I). This interpretation 
should be guided by the purpose for which the data are being studied. 
Experience has shown that tasks of conceptual knowledge processing mostly 
are concerned with questions which relate only to a small number of many- 
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Figure 2: The concept lattice of the context of Figure 1 



valued attributes. The situation has an analogy to landscape exploration 
which also can not view everything at the same time and therefore proceeds 
in a predominantly local way. For conceptual knowledge communication via 
line diagrams, the local nature of knowledge processing justifies the strategy 
of presenting only combinations of a few line diagrams representing concept 
lattices of conceptual scales. The global aspect then lies in the potentiality 
for navigating toward every conceptual scale. For the realization of those 
ideas, conceptual data systems are designed which consist of a many-valued 
context stored in a (relational) database and a collection of conceptual scales 
with line diagrams of their concept lattices (see Vogt (1991), Scheich (1993), 
Wille (1992b)). For the combination of line diagrams, nested line diagrams 
(see Wille (1984), Wille (1989)) are very successful for graphical represen- 
tation. Numerous implementations of conceptual data systems have been 
performed by using TOSCANA, a management system for conceptual data 
systems, developed at the TH Darmstadt (see Kollewe (1994), Vogt (1994)). 
The name “TOSCANA” (= “Tools of Concept Analysis”) was chosen to 
indicate that this management system allows us to implement conceptual 
landscapes of knowledge. In choosing just this name, the main reason was 
that Tuscany (Italian: Toscana) is viewed as the prototype of a cultural 
landscape which stimulated many important innovations and discoveries, 
and is rich in its diversity and attractive for wandering in. The newest 
version of the TOSCANA management system is programmed in C++ by 
using ''The Formal Concept Analysis Librar'i/^ (Vogt (1996)). It can be 
connected to (relational) databases via their query languages. For realizing 
a conceptual data system, its conceptual scales and line diagrams can be 
edited and implemented by using the program Anaconda for creating files 
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in the script language ConScript, which was especially developed for this 
purpose (see Vogt (1996)). With TOSCANA, a user can always choose a 
sequence of implemented conceptual scales which he wants to activate for his 
exploration. After selecting the number n G {1,2, 3, 4} of scales whose line 
diagrams are nested at once, TOSCANA presents a nested line diagram of 
the concept lattices of the first n line diagrams. If the user has decided where 
to refine the presented diagram, he can zoom into that part by activating 
the next scale in the sequence. In this way he can zoom further and further, 
but also interchange scales and even insert new scales. The basic idea of 
this navigation method lies in the conviction that each serious navigation is 
connected with a learning process of the user, who increasingly understands 
better how to specify what he is looking for. Of course, TOSCANA enables 
the user always to make prints of the interesting diagrams which he has 
found while navigating through the “conceptual landscape” . 



3 Tasks of Conceptual Knowledge 
Processing 

Conceptual knowledge processing has as its overall aim supporting human 
communication and argumentation to establish intersubjectively assured 
knowledge (cf. Wille (1994)). Therefore formalizations for the computational 
treatment of knowledge always have to keep a sufficiency of connections to 
the contents which have been formalized. In particular, the formalization 
of concepts must allow for the reconstruction of their extensional and in- 
tensional meaning, even after they have been treated by formal procedures. 
For many methods of knowledge processing that is critical, for otherwise 
they may rely too much on modern mathematical logic in which the unity of 
extension and intension is broken (cf. Wille (1995)). Modern logic should be 
extended by including the old logic paradigm represented by the graduated 
scheme concept- judgment- conclusion (cf. Wille (1996)). That would open 
modern logic to discourses of communicative rationality which are necessary 
to reach significant knowledge. 

The formalization of concepts in Formal Concept Analysis keeps the unity 
of extension and intension. On an elementary level, the connections to the 
contents are given by the names of objects and attributes (and attribute 
values). Furthermore, the relationships between objects and attributes also 
may be named to document their specific nature (this idea even led to a 
triadic approach to Formal Concept Analysis (Lehmann and Wille (1995))). 
On another level, connections are represented by the names of objects and 
attributes of conceptual scales and by names of the scales which allow for 
connecting with the contents of more theoretical views. For conceptual 
knowledge systems this level is important because it opens up the possi- 
bility of inserting expert knowledge in the system on a general level. An 
even more general level is given by the structural patterns which display 
models of interpretation methods, such as dependency, dimensionality, etc. 
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For connecting all these aspects in the contents and the formal parts, the 
idea of conceptual landscapes of knowledge yields a paradigm for useful repre- 
sentations. That will be outlined by discussing several applications in which 
tasks of conceptual knowledge processing were performed by using graphical 
representations of such “conceptual landscapes of knowledge” . 

3.1 Exploring 

Exploring shall mean looking for something of which one has only a vague 
idea as it often happens, for instance, in retrieving literature. In an inter- 
disciplinary research project (1991/92), a prototype of a conceptual data 
system for retrieval in the library of the “Center of Interdisciplinary Tech- 
nology Research” at the TH Darmstadt was developed and implemented by 
using TOSCANA (see Kollewe et al. (1995)). Books were chosen as ob- 
jects and keywords as attributes. For studying processes of focussing ideas, 
conceptual scales were designed to represent aspects of research questions. 
Experience with the implemented prototype has been so productive that a 
TOSCANA retrieval system now has been developed for the whole library 
of the center (see Rock (1996)). 

3.2 Searching 

Searching shall be understood as looking for something which one can more 
or less specify but not localize. During three years (1993-95), a TOSCANA 
searching system was developed at the request of the ministry of civil engi- 
neering of the German province Nordrhein-Westfalen, to enable the archi- 
tects of this province to find for a specific task in building construction all 
relevant paragraphs in laws, regulations, and standards (see Kollewe et al. 
(1994)). Paragraphs of documents were chosen as objects, components of 
buildings and building requirements as attributes. Conceptual scales repre- 
sent the paragraphs according to units of related components and require- 
ments. In the process of establishing the system, the line diagrams of the 
concept lattices already have proven useful because the coworking lawyers 
and engineers frequently discovered data mistakes and failures within the 
diagrams which substantially helped to improve the system. 

3.3 Recognizing 

Recognizing is understood with the meaning of perceiving clearly circum- 
stances and relationships. The research group on developmental psychology 
at the TH Darmstadt has extensively used concept lattices for evaluating 
interviews concerning concept development among children. Recently, they 
have started to use TOSCANA to better recognize the contents and con- 
nections indicated by the analyzed interviews. Their main interest is to 
deduce developmental sequences which explain how concepts may change 
during child development. In summarizing individual data, a context of 62 
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children, evaluated by nine criteria of concept development, was established. 
As result of a detailed argumentation integrating extensional and intensional 
views, the children were ordered in (branching) levels of development which 
became clear by inspecting the graphically represented concept lattice of the 
derived context (see Strahringer and Wille (1993), Seiler et al. (1996)). 

3.4 Identifying 

Identifying shall mean determining the taxonomic position of an object 
within a given classification. For identifying the symmetry type of wall- 
paper patterns, a computer program was developed for presentation at the 
Symmetry Exhibition shown at the Mathildenhohe in Darmstadt 1986. For 
establishing the classification as a concept lattice, a formal context was cre- 
ated with the 17 symmetry types as objects and the symmetry properties as 
the attributes which are usually considered by chrystalographers for identify- 
ing symmetry types of wallpaper patterns (see Kipke and Wille (1986), Wille 
(1987)). Then a line diagram of the concept lattice of this context serves 
as a map of a conceptual landscape indicating all possible paths (starting 
from the top) to reach the taxonomic positions labelled by the object names. 
The identification process (always represented in the diagram by its actual 
state) proceeds by inputing recognized attributes which determine the path 
toward the desired taxonomic position. 

3.5 Analyzing 

Analyzing in the scope of conceptual knowledge processing is understood 
as examining data in their relationships while guided by theoretical views 
and declared purposes. Data analysis in this sense has infiuenced the devel- 
opment of Formal Concept Analysis to a great extent. The conception of 
conceptual data systems and basic ideas for TOSCANA arose out of prob- 
lems in performing a comprehensive analysis of a given data context. This 
many- valued context was elaborated by political scientists for summarizing 
their case studies about international regimes. For a general understand- 
ing of international regimes and for examining theoretical views, conceptual 
relationships and dependencies between attributes were studied for which 
more than 100 line diagrams of concept lattices of smaller subcontexts were 
produced (see Kohler-Koch (1989), Vogt et al. (1991)). With the now exist- 
ing TOSCANA implementation of the regime context, this time-consuming 
work can be avoided and the analysis can be performed more fiexibly and 
comprehensively. 

3.6 Investigating 

Investigating means to study by close examination and systematic inquiry. 
Medicine is an important area for investigations utilyzing methods of concep- 
tual knowledge processing. The conception of conceptual data systems was 
presented first by way of a medical study reporting data about children with 
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diabetes (see Scheich et al. (1993)). Another study in medicine, concerned 
with cancer in children, has recently been conducted at Frankfurt Univer- 
sity using TOSCANA. The leading idea is to establish medical landscapes 
of knowledge in which enough single cases are conceptually located together 
with comprehensive (theoretical) knowledge about diseases and their treat- 
ments. The enormous diversity of individual data makes clear that a coding 
of knowledge by facts, rules, and procedures is insufficient. Knowledge rep- 
resentation has to be more open, in order to bring in experiences, intuition, 
individual abilities, human argumentation, etc. Thus, a knowledge system 
for medical investigations must support knowledge communication multifar- 
iously. 



3.7 Deciding 

Deciding shall mean resolving a situation of uncertainty by an order. To 
obtain such an order determining where swimming and drawing drinking 
water is allowed at the Canadian border of Lake Ontario, the degree of wa- 
ter pollution was tested at 26 locations from the Mouth of the Niagara River 
up to the Cataraqui River. Since it seemed not adequate to create a formula 
yielding a summarizing value for the values of the five tests, the many- valued 
context formed by the locations, tests, and test values were represented by a 
line diagram of a concept lattice which provided a clear picture of the data 
for the decision (see Strahringer and Wille (1992)). Such aconceptual pre- 
sentation keeping all the data ‘alive’ gains an advantage because it supports 
communication and argumentation necessary for the decision, and respects 
the fact that data and their formal treatments never can cover all relevant 
aspects; in particular, the labelled line diagram may serve as a basis for 
negotiating the decision. 

3.8 Improving 

Improving has the meaning of enhancement in quality and value. In con- 
ceptual knowledge processing an important task is to examine how given 
data and knowledge may be improved. As an example we report on an ap- 
plication of Formal Concept Analysis in optimizing the production of chips 
(see Wille (1993)). For the investigation four independent variables, namely 
“temperature”, “voltage”, “catalyst”, and “operator”, were chosen and, in 
addition, two dependent variables for indicating the quality of the produced 
chips. By an appropriate conceptual scaling of the resulting many-valued 
context of the measured data, a concept lattice was derived whose nested 
line diagram not only showed the optimal combination among the chosen 
values for the independent variables, but also how an even better combina- 
tion of new values (which should be tested) could improve this optimum. 
That is the advantage of a graphically presented conceptual landscape of 
knowledge: it even suggests where to extend the knowledge to obtain more 
insight for reaching a specified aim. 
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3,9 Restructuring 

Restructuring means to reshape a given structure, which, within the scope 
of our discussion, is conceptual in its nature. Restructuring with methods 
of formal concept analysis has been successfully performed in software engi- 
neering (see binding and Snelting (1996)). As an example, the X-Window 
tool “xload” is conceptually analyzed in Snelting (1995). From this 724-line 
program a formal context is deduced with pieces of the source code as ob- 
jects and governing preprocessor symbols as attributes. Its concept lattice 
consisting of 141 concepts is judged to be “pretty chaotic” which indicates 
that “the program obviously suffers from configuration hacking”. For un- 
derstanding and restructuring such software, concept lattices can give useful 
insights and can help to find simplifications for which lattice decompositions 
and attribute implications have been considered. 



3.10 Memorizing 

Memorizing is understood as a process of committing and reproducing what 
has been learned and retained. Within the scope of conceptual knowl- 
edge processing this is achieved by conceptual knowledge systems (see Wille 
(1992a)). Such a system has been prepared for the theory of finite lattices 
by a comprehensive investigation of all implications between 50 established 
lattice properties using the method of “attribute exploration” (see Reeg and 
Weifi (1990)). Numerous line diagrams of parts of the stored information 
have been elaborated, each of which presents a multitude of interesting ex- 
amples and propositions. This compressed knowledge representation has 
stimulated a project for building up further conceptual knowledge systems 
of mathematical theories. Such knowledge systems are designed for storing 
and graphically presenting mathematical knowledge, which is created by ex- 
perts, and for inferring further results, which can be obtained by routine 
mathematical reasoning. 



The desribed meaning of the different tasks was established with the aid of 
relevant dictionaries such as Duden (1993) and Webster (1987). For each 
task the appropriateness of the meaning for conceptual knowledge processing 
was examined by theoretical considerations and various examples of concrete 
applications. Of course, in applications, the tasks are often performed in a 
combined way, so that their specific aims merge in some kind of mixed 
aim. Therefore, the described tasks should be understood as ideal types 
which may vary and combine multifariously for fulfilling the great variety of 
possible purposes. Nevertheless, clear descriptions of tasks are necessary as 
guide lines for a successful development in conceptual knowledge processing. 
Hence, further ideal types of tasks should be elaborated. 
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Abstract: In this paper we examine the use of Java’s networking capabilities in a 
client/server application for data analysis. On the client side runs an applet which 
collects analysis requests. The applet sends the requests to the server where the 
analysis actually takes place. The server sends the analysis results back to the 
client which displays them graphically. We demonstrate the use of Java’s RMI by 
means of a simple classification example. 



1 Motivation 

In a typical scenario there are large amounts of data as well as algorithms 
for data analysis at one place and remote requests for data analysis over 
the internet should be possible. This scenario could be implemented as a 
client/server application. In doing so the data is in a protected environment 
on the server where data is condensed and only the analysis results are sent 
back over the net to clients. Now, in comparison to sending the original data 
less resources are needed. On the other hand, a powerful server is required 
for data analysis while representation of results on the client side demands 
only limited computational power. Figure I shows the general structure of 
such a connection. 




Client Internet Server 



Figure 1: Connection of client and server over the internet 
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2 An Example 

Since we want to concentrate on client/server programming techniques using 
Java we chose a very simple example for the server side. 

The aim is to cluster a set of n objects that are characterized by m variables 
via the /c-means algorithm (Hartigan, Wong (1979)). In order to be able 
to easily represent clustering results, we restricted the data type of each 
variable to IR and set m = 2. (Of course, we could also have chosen an 
algorithm that processes qualitative or mixed data and produces results in 
IR^.) A user of the system has to select one of the data sets available on the 
server and must provide the maximal number k of clusters to be computed. 



3 High Level Overview 

The first step on the client side is to start a browser and load the HTML 
page with the applet from the server. You need to know the URL of that 
page, for example: 

http://www.wifo.uni-mannheim.de/dresden98/client.html 

Next, you select the desired data set as well as the number of clusters. The 
request is then sent to the server and the client waits for the answer. The 
server program computes the result and sends it back to the client. Finally, 
the client presents the corresponding graphic and waits for new input. 

After the server is started it waits for client requests. If such a request 
occurs the server loads data from the local database analyzes it and sends 
the results back to the client. 



4 Implementation Techniques 

By now, numerous implementation techniques exist for interactive HTML- 
pages. A rather new and promising technique is the use of Java on the 
client side as well as on the server side. In this paper we demonstrate how 
the interaction between client and server can be achieved with RMI {Re- 
mote Method Invocation). Besides RMI you can use low level sockets, RPC 
{Remote Procedure Call), and CORBA {Common Object Request Broker 
Architecture). In view of our experiences, RMI is the clear winner for Java- 
to-Java interprocess communication. Some references concerning RMI that 
we can recommend are Berg, Fritzinger (1998), Orfali, Harkey (1998), Sun 
Microsystems Inc. (1998), and Vanderburg (1997). 

For the client side we implemented an applet, which lets the user specify 
the input data (filename and number of clusters) for the analysis algorithm. 
Since an applet can only establish connections to the server it came from 
the user cannot change the server name. Likewise, the port is not editable 
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because we solely use the default port. The HTML-page which embeds the 
applet looks like this: 

<html> 

<head> 

<title>Cluster Analysis</title> 

</head> 

<body> 

<center> 

<hl>Cluster Analysis</hl> 

<applet code="Client" width=“650" height="500"> 

This browser doesn^t know the APPLET-tag. 

</applet> 

</center> 

</body> 

</html> 

After a successful cluster analysis the browser (here Netscape’s Communi- 
cator) shows a graphic like the one in figure 2. 



5 Remote Method Invocation (RMI) 

Objects communicate through messages, i.e. one object calls a method of 
another object. In our example, the client object has to call a server method 
which actually performs the cluster analysis. The problem here is that client 
and server objects reside on different computers and therefore in different 
Java virtual machines. They have to communicate over a network. 

Java offers the RMI package for interprocess communication between Java 
virtual machines. After some preparation RMI enables a method of an 
object in one virtual machine to call a method of an object in another virtual 
machine with the same syntax and ease as a local method invocation. 

On the basis of a simple example for a cluster analysis we want to explain 
and demonstrate how the capabilities of RMI are used in practice. In doing 
so we will concentrate on the code snippets which are relevant for RMI. 



6 Implementation of the Server Side 

Methods that should be callable remotely have to be declared in an interface 
which extends java. rmi. Remote. Only those methods specified in a remote 
interface are available remotely. Such methods must have java. rmi. RemoteEx- 
ception declared in their throws clause. A remote method invocation across 
a network can cause numerous errors. This exception provides a mechanism 
to gracefully handle unlikely but possible failure scenarios. One possible 
interface declaration for our remote object would be the following: 
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Figure 2: Screenshot of the client side 



// RMIServer Interface .java 
import java.rmi.*; 

interface RMIServerInterface extends Remote i 
public ClusterAnalysis kmeans (String filename, 
int clusterCount) throws RemoteException; 

> 

Next, we define the kmeans method in the RMIServer class which implements 
our RMIServerInterface. The RMIServer class extends java. rmi. server. Unicast- 
RemoteObJect. This class provides support for point-to-point active object 
references (invocations, parameters, and results) using TCP streams. By 
extending UnicastRemoteObject, RMIServer will be automatically “exported” 
and is ready to be used outside the virtual machine in which it was created. 
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// RMIServer . java 

import java.rmi.*; 
import java.rmi. server.*; 

class RMIServer 

extends UnicastRemoteObject 
implements RMIServerInterface { 

public ClusterAnalysis kmeans (String filename, 

int clusterCount) throws RemoteException { 
try { 

FileInputStream fis 

= new FileInputStreamCf ilename) ; 

Object InputStream ois 

= new ObjectInputStream(f is) ; 

Data d = (Data) ois.readObjectO ; 
ClusterAnalysis ca 

= new ClusterAnalysis (clusterCount , d) ; 
ca. classifyO ; 
return ca; 

} 

catch (Exception e) { 
return null; 

} 

} 



} 

The kmeans method reads the data file, classifies the data, and returns the 
result. The arguments and return values of remote methods must implement 
the Serializable interface. So does the ClusterAnalysis class. 

class ClusterAnalysis implements Serializable { 

} 

The main method of the server class sets an RMI security manager. If no 
security manager has been set, RMI will only load classes from local system 
files as defined by the environment variable CLASSPATH or a path provided 
with the -classpath option. 

public static void main (String [] args) { 
try { 

System . setSecur ityManager (new 
RMISecurityManagerO) ; 
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Naming. rebindC'RMIServer" , new RMIServer () ) ; 

} 

catch (Exception e) { 
e .print St ackTraceO ; 

} 

} 

The next step is to create an RMIServer object and bind it to a unique name. 
A remote object can be bound to any name, but you have to be aware of 
name collisions. 

Even if the body of the constructor is empty you have to define a default 
constructor because the constructor of the superclass UnicastRemoteObject 
can throw a RemoteException. 

public RMIServer 0 throws RemoteException { } 

The server implementation itself is finished now, but there is some more 
work to do before clients can connect. 



7 Implementation of the Client Side 

For the client side of our client/server application we implement an applet. 
In the following we will concentrate on the code for the communication with 
the server. 

// Client. java 

public class Client extends Applet 
implements ActionListener { 



} 

When a browser or applet viewer loads an applet the in it method is called to 
inform the applet that it has been loaded into the system and can perform 
its initialization. Here, we first set an RMI security manager so that we can 
connect to the remote server. Additionally, in it has to provide code for the 
user interface shown in figure 2. 

public void initO { 

if (System. get SecurityManagerO == null) 

System. setSecurityManager (new RMISecurityManager () ) ; 



} 

After the user has pressed the Analysis button the action Performed method 
is called. Within this method body the client applet connects to the server. 
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public void actionPerf ormed(ActionEvent e) { 
try { 

RMIServerInterface si = (RMIServerInterf ace) 
Naming. lookup ("//" + server + "/RMIServer") ; 
ClusterAnalysis ca 

= si. kmeans (filename, clusterCount) ; 

} 

catch(Exception x) { 

} 



} 

The lookup method returns the remote object for the given URL. An RMI 
URL looks very much like any other URL. It has the general form 

rmi://host:port/name 

where host is the host name of the registry. It defaults to the current host. 
Applets can retrieve a reference to a remote object only from the server from 
which the applet came, port specifies the port number of the registry and 
defaults to the registry port number 1099. name identifies the remote object 
on the server. 

With the remote object at hand we are able to call the method kmeans 
which performs the cluster analysis on the server. It looks like any ordinary 
method call, but it really goes through a stub (see figure 3). 

Client Serve I* 




Figure 3: Stubs and Skeletons 



8 Stubs and Skeletons 

A stub is a client-side proxy that implements the remote methods of a remote 
object. The stub is responsible for packaging (serializing) the arguments of 
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a remote method invocation and passing control to the server. A skeleton is 
the corresponding server-side proxy that accepts a method invocation from 
a client, unpacks any arguments and dispatches the invocation to the target 
method on the server. 

Stub and skeleton class files are created for the server object implementing 
the Java. rmi. Remote interface by the rmic compiler. In our example the 
command 

rmic RMIServer 

produces the files RMIServer_Stub. class and RMIServer_Skel. class. Both files 
must be available to the virtual machine that is exporting the remote object. 
The clients of a remote object need access only to the stub class file. You 
can copy the stub file to the client or make it available from an URL. 



9 RMI Registry 

The rmiregistry command creates a remote object registry on a specific port. 
On a Windows box we do this by: 

start /min rmiregistry 

It is possible to specify a port number. If the port number is omitted, it 
defaults to 1099. The remote object registry is a bootstrap naming service 
which is used by RMI servers. 



10 Starting the Server 

Finally, the server program itself must be started: 

start java RMIServer 

Now the server is ready to answer client requests like the one displayed in 
figure 2. 



11 Summary 

The RMIServer class creates and exports a remote object. The Client class 
looks up the remote object in a registry and calls the kmeans method defined 
for the remote object. RMIServer defines the implementation for the remote 
object. A ClusterAnalysis object is created on the server and a copy is sent 
back to the client. The communication dynamics between client and server 
are shown in figure 4. 

Because this paper is all about RMI and not about good programming prac- 
tices, we show how to compile and run the application from a single directory. 
In practice, you would probably want to keep your source code outside your 
Web server’s directory tree. 
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Figure 4: Communication between client applet and server 



12 Conclusion 

Java simplifies the task of client/server programming. RMI is the right 
technique for Java-to-Java interprocess communication. It enables a method 
of an object in one virtual machine to call a method of an object in another 
virtual machine with the same syntax and ease as a local method invocation. 
Applications like remote data analysis are good examples of Java’s potential. 
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Abstract: The number of information systems in the world wide web is growing 
continuously. However, the development process of web information systems is 
not yet sufficiently supported by adequate methods during all phases. Especially 
methods for cost estimation are missing. Thus, the development of web informa- 
tion systems does not only bear the risk of unforeseen high implementation cost, 
but also of uncontrollable maintenance cost. In this paper we present measures for 
world wide web information systems based on a conceptual model. Existing cost 
estimation methods in software engineering are transferred to the development of 
web information systems. Furthermore, the computation of the size of an infor- 
mation system allows its classification and helps to find similar web information 
systems as reference. 



1 Introduction 

Information systems in the world wide web are in widespread use, the sup- 
ply with information through this new medium is growing continuously. 
However, the development of web information systems is not yet sufficiently 
supported by appropriate methods during all phases of the development pro- 
cess. Implementation and maintenance are supported by multiple tools like 
for example extended web page editors. For the design of web information 
systems, modeling concepts (Garzotto et al. (1995), Lenz, Oberweis (1998)) 
similar to those for the modeling of database systems have been proposed. 
For the planning phase, methods as known for the software engineering 
(Sommerville (1992)) are still missing. For example, the lack of cost estima- 
tion concepts makes it necessary to calculate development and implementa- 
tion cost based on intuition or experiences of past projects. Analytical cost 
estimation concepts require that the relevant parameters of influence for the 
development of web information systems are known and can be quantified 
with suitable measures. The metrics developed for that purpose can also be 
used to distinguish between web information systems due to different rele- 
vant characteristics. The information supply differs for example remarkably 
in size, in implementation and maintenance time and cost. Metrics also al- 
low the positioning of web information systems in a multidimensional space 
which may serve as basis for statistical analysis such as cluster analysis and 
for classification. Additionally, the semantic distance between two web in- 
formation systems can be determined. This allows, e.g., to find a suitable 
past project as reference for a future development project. 
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In this paper, we introduce metrics for the development of world wide web 
information systems based on a conceptual model of the information sys- 
tem. Therefore, we first describe the underlying conceptual model for web 
information systems, the so-called page link model. In the next section, ex- 
isting software metrics and cost estimation methods for software engineering 
are briefly surveyed. In section 4, we develop a metric for the size of web 
information systems and discuss relevant influence factors on the estimation 
of implementation and maintenance cost. Finally, a brief outlook on future 
work is given. 



2 Page Link Model 

The page link model supports the modeling of structural aspects of web 
information systems by a graphical representation. 

Page types stand for classes of web pages with identical structure and com- 
parable content. They are represented by rectangles in the page link scheme. 
If all web pages of one page type are combined to a single page, we call this 
page a list page. The list page may be grouped by a grouping criteria which 
then is indicated in the page link scheme. Page types have attributes in 
order to describe the content of web pages of this page type. 

In analogy to page types, links with comparable anchor, target and purpose 
are combined to a link type. Link types are represented by arrows between 
page types. We distinguish between links between two pages (unidirectional, 
bidirectional link) and a whole structure of links between several pages (in- 
dex link, guided tour link, and their combination, the index guided tour 
link). The graphical representation of the most important components of 
the page link model is shown in Figure 1. 











- 


criteria 



page type page type 
with list page 



unidirec- bidirec- index link 
tional link tional link 




guided tour 
link 



Figure 1: Components of the page link model 

The page link scheme for a web information system can be derived from an 
extended Entity Relationship scheme (Silberschatz et al. (1997)). Figure 2 
shows a very simple example for a supplier of products like books, software 
and accessories. A book can describe or belong to several software products. 
On the other hand, software can be described or belong to several books. 
Figure 2 (a) shows the extended Entity Relationship scheme for these as- 
pects whereas Figure 2 (b) presents the derived page link scheme. A list 
page is created for all products and the products are grouped by their type 
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(book, software or accessory). On this list page, each product should have 
its link to a specific web page with a detailed product description, the price 
etc. To simplify the example, we have omitted the attributes of the page 
types here. 




(a) 



(b) 



Figure 2: Page link scheme for a supplier 

A detailed description of the page link model and the derivation process of 
a page link scheme from an extended Entity Relationship scheme can be 
found in (Lenz, Oberweis (1998)). 



3 Software Metrics and Cost Estimation 

Software metrics are needed to quantify relevant characteristics of the soft- 
ware and of the software development process. Generally, characteristics 
can be measured in basic units such as ’lines of code’ (LOG) or ’function 
points’ for software code. Obviously, the counting of lines of code is not 
very suitable for the development of web information systems for almost the 
same reasons for which it has been criticized in the area of general software 
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development (Humphrey (1995)). A widely accepted method for the estima- 
tion of the size of a software system is the function point method (Garmus, 
Herron (1996)). The basic unit ’function point’ describes important software 
functions like for example input or output. 

Metrics in basic units can be weighted and combined to more complex for- 
mulas. This allows the definition of so called derived metrics in order to 
calculate a measure like the size of a software system or productivity. If 
software metrics are used during the planning phase of the development 
process, the basic units cannot be quantified at this time. Therefore, one 
has to rely on existing measures which have been validated by already fin- 
ished projects and which serve as basis for the estimation of the basic units. 
In order to obtain a relatively high degree of expressiveness the estimation 
must be as simple as possible to compute and at the same time as precise 
as possible. The number of infiuencing parameters has to be restricted, but 
without diminishing the quality of the measure by too much simplification. 
Traceability of the estimation can facilitate error detection, refinement or 
the adaptation to changing externalities. 



4 Metrics for Web Information Systems 

The estimation of the size of a software system and its implementation cost 
can be done by the function point method. The important functions of the 
system are counted and weighted, and thus the development cost can be 
calculated with formulas that are to be customized to the specific project 
requirements. This concept can be transferred to the development of in- 
formation systems, where so called ’web points’ have to be counted. Then 
corresponding weights are to be found, and finally formulas for the calcu- 
lation of size and cost have to be generated. Until now, descriptions of 
the size of information systems are often restricted to the number of web 
pages or storage space. However, the size and maintenance cost of a web 
information system also depend on the number of links. For example, more 
maintenance work has to be done for an information system consisting of 
only three pages with many links to pages of other information systems than 
for an information system of about twenty pages and only a few correspond- 
ing links. Consequently, ’web points’ have to be counted for both, web pages 
and links. For that purpose, we use the page link model. 



4.1 Size of a Web Information System 

First, we consider the size of a web information system. To know the size 
of the information system gives us a first idea of how expensive and time 
intensive the development process will be. It helps to compare a specific 
development project to other projects and thus to find reference projects. 
Let n (n > 1) be the number of different page types of the page link scheme. 
Then for each page type i {i = l,...,n), pi denotes the estimated average 
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number of pages at one time. The number of attributes of that page type is 
ai. A very simple measure for the size of the page type i is 

Pi X ai. 

Now we have to look closer at the link types: the Kronecker symbol 6ij 
indicates whether there exists a link from page type i to page type j {Sij — 1) 
or not {Sij = 0). lij denotes the estimated average number of links from one 
page of the page type i to a page of page type j. Let uoij design a weight. 
The weight has to be chosen adequately depending on the link type. For 
example, different weights for context links, structural links for navigational 
help, links without local administration possibility and local links for the 
navigation on the web page itself are possible. For simplicity, we suppose 
^ij = ujji In Figure 4.1, a small excerpt from the page link model of 

our example is shown together with the relevant variables. 





unidirectional link with weight (Oij =0)ji r 

8ii=l,5ji=0 




supplier 


product 


1-j estimated average number of links 


page type i 


type 




Pi estimated average number of pages Pj 

aj number of attributes of page type aj 


page type j 



Figure 3: Detail of a page link scheme together with variables for size des- 
tination 



We then compute 



^zji(l lij) , 

1 + tOij{l + hj) , 

2iJij{l + /jj) , 



unidirectional link 

index link with integrated list page 

index link with extra list page 

guided tour link 

index guided tour link 



as the number of links weighted by the respective link type. With this, we 
can define a measure for the size of a web information system S as 



n n 

S — ^ j Pi{^i "b ^ ^ Sjjljj). 

i=l 2=1 

This estimation of the size of a page type can still be further refined. For 
example, each attribute could be weighted corresponding to the underlying 
domain of the attribute (text, image, video, etc.). Additionally, the average 
size of the attribute values could be estimated. These measures must be 
normalized in order to make them comparable and then to integrate them 
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into the formula for size estimation. Finally, development projects must 
show if the increase of estimation precision by adding another weight or 
other measures justifies the additional effort. 

In contrast to the syntactical size S which is a measure for the presentation 
of information, the semantical size of a web information system measures the 
extent and relevance of the information itself. In order to find similar infor- 
mation systems to a specific web information system, a comparison should 
base on both, syntactical and semantical size. But as a first step, the syntac- 
tical size of a web information system helps to categorize its implementation 
and maintenance cost for the development process. 



4.2 Implementation Cost 

As a result of a pre-Delphi research, we identified three cost categories that 
occur in the prelaunch phase of web information systems. The cost caused by 
the technical requirements are related to the hardware and the implemen- 
tation environment. Hardware related cost comprise (among others) the 
internet access, routers, server, backup systems, proxy and firewall systems. 
Computers, operating systems, development software and education needed 
for the implementation of the web information system cause cost related to 
the implementation environment. The third category concerns the content 
of the information system. Content related cost are cost for the concept of 
the media content, the media planning, text, pictures, the layout and the 
programming. 



4.3 Maintenance Cost 

The prediction of maintenance cost is more difficult than the prediction of the 
implementation cost, because maintenance time-cycles depend on external 
influences. On the other hand, the cost of maintenance top the implemen- 
tation cost by many times in the lifetime of a system (Sommerville (1992)). 
The method to identify the implementation cost can be used as well to iden- 
tify the maintenance cost. As result of the pre-Delphi research, we identified 
two categories: first the technical basis with the cost for connections, inter- 
net fees, and site administration, and second the information contents with 
the cost for updates, online marketing, mailing lists and response services. 
Maintenance can occur periodically, permanently or only few times (for ex- 
ample once). It can be either predictable or unpredictable, i.e. in the latter 
case maintenance work is done spontaneously. Maintenance expenditures are 
caused by four reasons: the correction of faults, the adjustment of the site, 
the extensions of the functionality and/or the improvement of performance. 
A metric has to cover the objects of maintenance as well as the probability 
that one of these four reasons occurs. Referring to the objects of a site, the 
complexity of maintenance can be described by calculating the expenditure 
for the basic operations ’insert’, ’delete’ and ’update’. 
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5 Outlook 

A Delphi study is currently done to find a complete list of cost factors that 
determine the cost of development and maintenance. The Delphi technique 
has experienced considerable acceptance inside and outside MIS Research 
for forecasting problems (Niedermann et al. (1991), Saunders, Jones (1992), 
Malhotra et al. (1993) (Robeson 88)). Any Delphi study comprises several 
rounds of opinion gathering from an expert panel. In each round, members of 
the panel are asked to react in writing to a shared document that summarizes 
the evolving consensus, as well as current positions and arguments of all 
members of the panel (Martino (1983)). 

Besides the identification of a complete list of cost determinants, the Delphi 
Method allows to rank and validate the factors. Metrics for the implemen- 
tation and maintenance cost can be derived by assigning those determinants 
to the page link scheme. Weights that have to be found for these metrics 
(e.g. for the different types of maintenance) will be validated by already 
finished projects. 

Furthermore, a database where information about the development pro- 
cess are stored will be created for finished web information system projects. 
Based on the metrics for size, implementation and maintenance cost, a met- 
ric for the semantic distance of two web information systems is to be devel- 
oped. This metric will allow to identify within the database past projects 
as references for a future web information system project. 
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Abstract: In order to completely support business processes using a workflow 
management system, all the knowledge inherent in these processes has to be cap- 
tured by the workflow modeling language. To meet the requisitions for flexibilty 
and expressiveness we developed the workflow language EventFlow/, based on an 
event model and we integrated the Unifled Modeling Language (UML) for data 
modeling and the Object Constraint Language (OCL) as a connecting element. 
The events generated during execution of a workflow can be analyzed to learn 
from completed processes. 



1 Introduction 

In the business environment there is a strong connection between knowledge 
and business processes. Business processes contain knowledge about work 
and business goals. Of special importance is so-called tacit knowledge which 
is necessary for really getting work done (Sachs (1995)). 

In order to support and partially automate business processes many orga- 
nizations have applied workflow management systems (WFMS) in recent 
years. Before such a system can be used the business processes concerned 
have to be modeled using a workflow modeling language. This language 
needs enough expressive power and flexibility to capture all relevant knowl- 
edge which is associated with the processes being modeled. Unfortunately 
this requirement is not met by most of the currently available workflow sys- 
tems (Vossen, Becker (1996), Wargitsch et al. (1997), Abbot, Sarin (1994)). 
Therefore we decided to develop the graphical workflow modeling language 
EventFlowi, which offers improved support in the following areas: 

• During the modeling phase the concept of stepwise refinement can 
be used offering a prototyping-like development which leads to fast 
feedback from the workflow users, so that modeling mistakes can be 
found at an early stage. 

• The use of an event model allows flexible exception handling and the 
coupling of groupware systems with the WFMS. 

• Another focus lies on the flexibility of running workflows; just for the 
current workflow instance (ad-hoc change) or for all instances of the 
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type in question (dynamic change). This is an important feature to 
capture the dynamic nature of business processes (Ellis et al. (1995)). 

• Time constraints can be attached to workflow descriptions so that the 
violation of deadlines can be controlled. 

• For data modeling we integrated the object oriented Unifled Model- 
ing Language (UML) (Rational (1997/1) and its companion the Ob- 
ject Constraint Language (OCL) (Rational (1997/2). UML/OCL is 
not only used for the description of the workflow data, but also for 
the meta-model of EventFlowL- The meta-model as well as workflow 
data can be referenced in the process model via OCL which is used in 
EventFlow/, for the formulation of conditions and constraints. 

Aside from that strong connection to the UML, we did not integrate 
further elements of UML into our concepts. Especially the activity di- 
agrams are considered as a good candidate for business process mod- 
eling. However there is a difference between business process mod- 
eling and workflow modeling. Workflow descriptions are on a more 
detailed and perhaps also on a more technical level. Furthermore ac- 
tivity diagrams have no means to express many of the key concepts of 
EventFlowL- There is no notion of time or time constraints; it is not 
possible to describe unstructured processes and the use of OCL in ac- 
tivity diagrams is not as stringent as in EventFlowL which is necessary 
to make the workflow descriptions executable on a WfMS. 

But nevertheless activity diagrams (like several other business pro- 
cess modeling diagrams, e.g. EPCs (Scheer (1995))) may complement 
EventFlowL on the level of business process modeling where one has a 
need for less formal description methods. These models may then be 
used as a basis for the workflow model. 



2 Event FIowl 

2.1 Basic Elements 

In this section we will introduce the basic elements of EventFlowi, which can 
be found in other workflow languages in some variations (WfMC (1996)) 
making them readily comprehensible. Throughout this section the ware- 
house order processing workflow in Fig. 1 will demonstrate the use of the 
EventFlowL-elements. To keep the example simple, we omitted the data 
model to define entities like “Order” and “Product” which are referenced in 
the workflow. 

A workflow in EventFlowL consists of worksteps graphically represented as 
rectangles. Each workstep has a name and a unique indentifying number. 
The causal dependencies between these steps are depicted by arrows. 
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Figure 1: The warehouse order processing workflow 



A process can be split into several branches by using a fork element. There 
are three different forks available: the XOR-fork splits a process in alterna- 
tive branches (i.e. only one branch will be actually executed), whereas the 
AND-fork introduces branches which are executed in parallel. The OR-fork 
Anally permits one to execute m out of n alternative branches. Fig. 1 con- 
tains an XOR-fork after workstep (1) and an AND-fork before worksteps 
(3) and (4). OR- forks and XOR-forks which have a condition attached to 
them are called controlled forks, because the WFMS can automatically de- 
tect which paths have to be executed. At an uncontrolled OR- or XOR-fork 
this decision is left to the workflow user. Corresponding elements exist to 
converge several branches into a single one. For example after worksteps (3) 
and (4) an AND-join synchronizes the parallel execution of these steps and 
two alternative paths are converged by an XOR-join just before the “End”- 
mark. The OR-fork and the OR-join (not shown in Fig. 1) are depicted by 
a circle containing the logical OR-symbol (V). 

Due to space restrictions we cannot show and describe the elements for data 
flow, roles and resources. 

2.2 Event Model 

The event model is the basic concept in EventFlowi, to describe the execution 
semantics. Furthermore it serves as a coupling element between workflows 
and between workflows and external systems (e.g. groupware applications). 
Each important point in an EventFlow/, workflow is annotated with a so- 
called mark. Each fork and each join element have a mark and the beginning 
and end of every workstep have one. When such a mark is reached during 
execution an event is generated. 

On the one hand these events are used to control the execution. An AND- 
join for example waits until all directly connected marks on incoming paths 
have emitted an event, whereas an OR-join proceeds (and generates its 
event) as soon as one directly connected mark on an incoming path has 
emitted its event. 
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On the other hand events serve as a connecting mechanism between other 
applications and a workflow. If the orders in our warehouse example arrive 
by e-mail and the e-mail application generates an event whenever this hap- 
pens, then the execution of the order processing workflow can be triggered 
by this event. Fig. 2 shows how this is modeled. 



Q-mail 

arrival 





(1 ) receive 




order 



Begin 



Figure 2: An event triggers the begin of a workflow 



Besides being able to start a workflow, events can also be used to influence 
a workflow at other points. If for example the “deliver” -workstep can only 
be started when a truck arrives at the warehouse, then this situation can 
be described as shown in Fig. 3 (assuming there is an application which 
generates an event of type “truck arrival” whenever a truck arrives). 



truck 




Figure 3; An event influences a workflow 



The events which are generated during the execution of a workflow form a 
sort of a trace of what really happened, i.e. they contain the information 
about the paths through the workflow which have actually been taken, about 
exceptions that occured and about ad-hoc changes. Since the events are time 
related data (every event gets a timestamp at the moment of its creation) we 
can query the resulting database using the Graphical Temporal Lanugage 
(GTL) (Oberweis, Sanger (1994)). Another way to use this data is to add it 
to a data warehouse and associate it with other business data. If the data 
warehouse is used in the form of a corporate memory (Erdmann (1997)), 
valuable knowledge can be extracted from it. The knowledge which is drawn 
from the event trace of a workflow can be used to improve current business 
processes. 
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2.3 Hierarchies in Workflow Descriptions 

As mentioned above, worksteps can be hierarchically refined in EventFlow^. 
Each workstep can be a workflow on a more detailed level and each workflow 
can be seen as a workstep on a higher level. The marks at the beginning and 
end of a workstep are identical to the marks at the beginning and end of its 
refined workflow; they serve as a connecting element between the different 
levels of the hierarchy. 

Workstep (6) is an example for a refined workstep. This is indicated by the 
shadow beneath its rectangle. Fig. 4 shows its refinement (the workflow will 
be explained in section 2.6). 




Figure 4: Refinement of workstep pay (7) 



2.4 Pre-, Start-, Post- and Endconditions 

Several conditions (formulated in OCL) can be attached to a workstep to 
specify the circumstances of its execution in more detail. 

Pre- and postconditions serve as a guard to prevent the workflow from con- 
tinuing in the case of an erroneous situation. The precondition is checked 
before the workstep begins and the postcondition after its completion. If 
one of these conditions does not evaluate to “true”, an exception event is 
generated and the workflow stops. In such a case a workflow can be started 
which handles the exception. There is no need for special modeling elements 
to describe an exception workflow; it is an ordinary workflow whose start is 
triggered by the exception event. 

Start- and endconditions are complementary to pre- and postconditions. A 
startcondition is evaluated before the respective workstep begins. If the 
condition is met, then the workstep is allowed to start. If it fails, then the 
system waits and constantly reevaluates the condition until it is met. An 
endcondition has the same functionality but at the end of a workstep; it 
controls whether a workstep is allowed to terminate. 

To make the distinction clearer between pre- and postconditions on the one 
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hand and start- and endconditions on the other hand: pre- and postcondi- 
tions stop a workflow if “too much happened” which led to a faulty situation, 
whereas start- and endconditions wait if “not enough happened” or if there 
is “something left to be done” before the workflow is able to proceed. In a 
concrete case this distinction may not always be clear and the workflow mod- 
eler can make his decision dependent on the desired reaction of the workflow 
system (wait or stop). 

In a workflow description pre- and startconditions are surrounded by a rect- 
angle which is attached to the lower left corner of the workstep rectangle. 
The same holds for the post- and endcondition; their rectangle is attached 
to the lower right corner of the workstep rectangle. The precondition is in- 
troduced by the label “PRE” , the startcondition by “START” and the post- 
and endconditions by “POST” and “END” respectively. 



2.5 Change of Running Workflows 

Change is an inherent element in real-life business processes (Ellis et al. 
(1995)). Therefore a WFMS has to consider how change can be supported. 
Change can occur on a per business case basis, i.e. a business process devi- 
ates from its normal execution in a speciflc case and manner; this form of 
change is called ad-hoc change. 

Offering adequate flexibility in the held of ad-hoc change means that one has 
to And a middle course between too much freedom (allowing everything) and 
too much rigidity. The aim is to be able to adapt to unforeseen situations 
but at the same time not to bring the workflow into an erroneous state or 
into a situation where it cannot proceed. Therefore we offer the workflow 
user a set of change primitives which can be applied to a workflow or to 
worksteps safely. They allow, for example, to skip or repeat a workstep 
(operations skip, repeat) or to delegate the execution of a (manual) workstep 
to another person (operation delegate). 

Another form of change - dynamic change - takes place on the type level. 
This is in contrast to ad-hoc change which affects only a single instance of 
a workflow. The problem with dynamic change is how to migrate running 
workflow instances to the new workflow type. The most powerful strategy is 
dynamic migration (Bichler et al. (1997)) which allows workflow instances 
“not ready for migration to continue according to the old workflow until they 
qualify for migration” . During the migration process a migration workflow 
is often needed which transforms the state of the old type to a state which 
is compliant with the new workflow type. 

Dynamic change in EventFlow/^ is accomplished by the migrate operation 
which takes as arguments the migration workflow, the old and the new work- 
flow type as well as the mark where the old workflow is left and the mark 
where the new workflow is entered. 
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2.6 Temporal Constraints 

From time to time a process reaches a point where it has to wait until some 
events happen or some actions are taken which are both outside the world 
of the workflow and outside the WFMS. To avoid blocking the process for 
an unreasonable period of time, a time limit can be used which is a kind of 
a fork. 

Fig. 4 shows an example of a time constraint. The workflow waits until the 
payment is made, but the waiting period is limited to 20 days. After this pe- 
riod a reminder is sent to the customer and after waiting for another 20 days 
the case is passed to a collection agency. The example also shows how ele- 
ments of the meta-model can be referenced (Process(6. 2). end. event. occurred). 

2.7 Weakly Structured Workflows 

In weakly structured workflows not all causal dependencies between work- 
steps are flxed in advance. In this case we need a more dynamic approach, 
where these relationships can be expressed depending on the current state 
of the workflow. 

This can be done in EventFlow^ using workstep-groups and startconditions. 
A workstep-group is a workstep containing other worksteps whose causal 
dependencies are not flxed but rather are controlled by their startconditions. 
Fig. 5 shows an example with three worksteps (A, B and C). These steps 
can start in an arbitrary order and they can be executed in parallel. The 
only constraint is that if two of the worksteps are running, the third one is 
not allowed to start. 




Figure 5: An example for a workstep-group 

3 Conclusions 

We have presented the graphical workflow modeling language EventFlow^ 
which has features that allow the workflow modeler to capture important 
details of the business process being modeled. Among these features are 
start- and stopconditions, pre- and postconditions, time constraints, change 
primitives to accomplish ad-hoc as well as dynamic change and concepts 
to support weakly structured workflows. On the data modeling side we 
integrated the object oriented Unifled Modeling Language and connected it 
to the process model using the Object Constraint Language. 
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The event model in EventFlow^ allows one - among other things - to collect 
information about the execution of workflows. This information can be 
analyzed in various ways and the knowledge extracted in this way can be 
used to improve existing processes. 
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Abstract: We argue that categorization and visualization of the explored infor- 
mation space (e.g. the web) is desired in order to improve the re-access of informa- 
tion. Traditional browsers support only mechanisms such as bookmarks and/or 
history lists in order to revisit and to reload documents. This paper presents 
an innovative tool (CineCat) that uses the client side cache to categorize and 
visualize the explored web space. CineCat supports filtering techniques based 
on categories to provide simplified views of the information space. 



1 Introduction 

The bookmark and history list mechanisms are available in most web 
browsers since they were introduced by Mosaic (NCSA (1998)). Bookmarks 
provide a means for the users to remember visited web pages in order to 
reload (recall) these pages at some later time. The history mechanism allows 
within a session to re-inspect (revisit) previously loaded web pages based on 
a stack model. It is interesting to note that many users have the misconcep- 
tion that the history list refers directly to the temporal ordering, in which 
pages have been previously loaded (see e.g. JONES, S., COCKBURN, A. 
(1996)). This paper argues that the persistent client side cache of a browser 
is - among its other purposes - a powerful instrument for both, revisiting 
and recalling of web pages. 

Navigation in the web may be an arduous task because of the absence of 
a global categorization and visualization scheme for the enormous amount 
of available information. The client side cache of a browser represents the 
explored web space and is a good starting point for a categorization and 
visualization scheme. 

This paper presents a tool called CineCat (HULSBUSCH, T. (1997)) that 
enables the user to define categories, and to categorize and visualize the 
cached documents and the link structure between these documents. The user 
can navigate easily within the explored web space and revisit or recall web 
documents. CineCat includes several view options that can tailor efficiently 
the display of large and strongly interlinked information spaces. 

The paper is structured as follows: In Section 2, we give a short introduction 
into web caching works and the cache hierarchy in the web. In Section 3 
the functionality of the history list and bookmark mechanisms are discussed. 
Section 4 describes CineCat. Finally, Section 5 offers concluding remarks 
and a brief description of future work. 
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2 Cache Hierarchy in the Web 

In the current Web, documents are identified by uniform resource locators 
(URLs) that refer to the location (host and directory) where a document 
is stored. This physical addressing mechanism requires the heavy use of 
intermediary caches to avoid the hopelessly repetitious transfer of interesting 
documents. Figure 1 shows a typical setup where documents from a distant 
server (Internet) are transfered through possibly multiple proxy servers (e.g. 
one at the access provider, one in a company or department). 



Projcy^Caches 




Figure 1: Cache hierarchy in the web 

Usually there are two cache levels on the client side (Web Browser) as well: 
The highly volatile cache (memory cache) is primarily used for the history 
mechanism, where the user can use e.g. the Back button to see exactly 
the same document he saw when the resource was retrieved the last time 
(Fielding, R. et al. (1997)). The memory cache contains static and computed 
pages. 

The browser side disk cache is a persistent cache that contains full or partial 
cache-able documents together with some meta data (URL, last modification 
date, content type, current size of cache entry, time of last retrieval, etc.). 
When the user requests a document the web client checks for an entry in the 
local cache and verifies the validity of the entry before the request leaves the 
client machine. Typically the size of the client side disk cache is bounded to 
a maximum size and cache entries are removed using a LRU (least recently 
used) mechanism based on the last access time. The client side disk cache 
contains the most recently explored web space. For most browsers this 
important set of web pages is more or less invisible to the user. 

One motivation of this paper was to experiment with the potential of the 
most recently explored web space and to examine how to categorize and 
to visualize this information space in order to improve navigational trans- 
parency. Visualization can help to develop a spatial understanding of the 
explored information space in order to locate information in the already-seen 
web space quickly. Filtering mechanisms based on categories can be used to 
provide different views on the information space (e.g. the information space 
relevant to one or more research topics). The local cache can assist off-line 
browsing and cooperative browsing when the cache is shared. 
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3 Support for Navigation and Categorization 
in Browsers 

In the following we describe the functionality of the bookmark and history 
list mechanisms: 



History List 

The web pages visited during a session are kept in the linear history 
list, which can be used to recall these pages at some later time. The 
history list contains the shortest navigation path from the session start 
to the actual document. The Back and Forward buttons of a browser 
can be used to navigate backward or forward in this navigation history. 

Note that in general the Back and Forward buttons do not control 
browsing in the temporal ordering of previously visited pages, but 
determine the currently displayed page in a stack of pages (JONES, 
S., COCKBURN, A. (1996)). 

This stack of pages in the history list does not necessarily contain all 
previously visited pages: e.g. hitting twice back and activating a new 
link deletes two recently read pages from the stack. Entries are also 
lost, when the browser is terminated. The volatility of the history list 
entries entails that it is not feasible to provide additional information 
(such as meta-data) for these entries. Therefore the history list cannot 
support persistent categorization well. 



Bookmarks 

Bookmarks can be used to save document locations (URL) in order to 
revisit these pages at some later time. When a bookmark is added, 
the URL of the current document is placed into a file with other book- 
marks. Note that these entries are typically as well in the cache. En- 
tries in the bookmark file can be categorized hierarchically. Each entry 
belongs to one category. When an entry should be placed in more than 
one category, it must be duplicated. 



In the remainder of the paper we will investigate how the functionalities 
for recalling and revisiting of pages can be provided through a single, more 
powerful mechanism that exploits the local cache. Every local disk cache 
provides already meta-data so categorization is a conservative extension. 
However, the local cache with all links is a very complex structure which 
has to be visualized to become explorable. The next section describes the 
innovative tool (CineCat) that analyses and extends the client side disk 
cache and gives the user the ability to categorize the contained documents 
and to visualize the visited information space. 
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4 cineCat - Cineast Cache Analysis Tool 

CineCat is implemented in the Wafe (Neumann, Nusser (1993)) environ- 
ment which includes support for several libraries like OSF /Motif and OTcl 
(Wetherall, Lindblad (1995)). The user interface of CineCat is implemented 
using OSF /Motif, its application logic is implemented using OTcl. CineCat 
uses the client side cache of the Cineast web browser (Koppen et al. (1997)). 
The Cineast browser, CineCat, and the Wafe software package are free avail- 
able (WAFE (1998)). In order to visualize and categorize the cached data 
and its link structure we have to deal with the following aspects: 

• Which objects can be visualized and categorized? 

• How are these objects linked (kinds of links)? 

• How can these objects be visualized, i.e. how can they be presented to 
the user in a clear way? 

• Which icon set can be used to visualize the categories in a clear way? 

• How should the visited web space be graphically displayed (layouting)? 

• How can cache entries be categorized in a user friendly way? 

In the following we will discuss these aspects and finally we will describe the 
CineCat user interface and its functionality. 

4.1 Visualizing Documents 

A client side cache includes the visited web pages, images and other hyper- 
media data like audio files. A cache includes also meta data which is of 
minor importance for this paper. In order to give the user a suitable view 
of the cached documents and their link structure we only consider the web 
pages (HTML documents). We distinguish between server start documents 
(Hosts) and other documents (Pages) that are linked from the server start 
document. 



Host Host ; Page 




wm « rea 1 « com uim ^ subst i tut e « com i Page , htm 1 

Figure 2: Symbols for Server start documents and others 

Figure 2 shows, how we visualize these two types of documents (two server 
start documents and a document). The icon depicted in the middle of each 
symbol designates the category of this document. Documents that are not 
already categorized can be visualized by using predefined icons. For example. 
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in Figure 2 we use the icon in the middle to indicate an uncategorised server 
start document. For a document (Page) we use a different icon. The user can 
define his own categories and his own icons. All predefined and user-defined 
categories can be assigned to any document. 



4.2 Visualizing the Link Structure 

CineCat distinguishes between three kinds of links (see Figure 3): 



Host Host 



^ ^ 




Figure 3: Links types 



• Internal links. An internal link references a document that exists on 
the same web server (link is displayed with a red dashed arrow). 

• External Links. An external link references a document that exists on 
a different web server (link is displayed with solid blue arrow). 

• Help Links. A help link binds two server start documents in order 
to indicate that there is a document on a web server that references 
another document on a different web server. This help link is needed 
when the web space of a web server is not completely visible in the cur- 
rent view, but it contains currently not displayed documents referring 
to other documents (ink is displayed with a dotted green arrow). 

4.3 User Interface Structure 

The user interface of CineCat contains four major areas (see Figure 4): 

• Menu bar and Tool bar. These components in the top area allow the 
user to select functions depicted as icons or as pull down menus. 

• Category area. This area is depicted in the top of the left hand side 
of the user interface. All user defined categories will be presented in 
this area. The user can define new categories and assign user preferred 
icons to them. Categories can also be deleted. 
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Figure 4: The CineCat user interface 



• Cache content area. This area is depicted in the bottom of the left 
hand side of the user interface. In this area the URLs every server 
start documents available in cache will be displayed. If an entry is 
selected the contents of the corresponding web server are displayed 
hierarchically. 

• Display area. This area is depicted in the right hand side of the user 
interface. In this area a graph will be displayed that represents the link 
structure of the documents that belong to the selected categories and 
selected web servers. The user can assign a document in the display 
area to several categories by drag and drop. As result the icon of the 
first category of this document will be shown as part of the document 
symbol. 

In Figure 4 the black entries in the “categories” and “cache contents” ar- 
eas are currently selected. If the user selects a category from the category 
area then all corresponding documents and server start documents will be 
displayed in the display area. These documents will also disappear if the cat- 
egory is deselected. In the same way the user can select certain web servers 
to be visualized. This mechanism allows the user to reduce the number of 
visible items in the view area. This is a powerful search mechanism that 
shows the categorized documents in context. 

In the view area a link can be selected to get e.g. its origin and destination. 
The user can select a document to view it with a simple HTML widget. 
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4.4 Layout of Interlinked Documents in the View Area 

The layout of the documents and the link structure is a complex problem, 
since it is not trivial to present a large amount of documents that are strongly 
interlinked. We have integrated various view options in CineCat that focus 
on aspects like global view, detailed host info etc. The user can select 
these options from the menu bar (view menu) to obtain another graph that 
represents the link structure of the cached documents or those documents 
that belong to the selected categories and web server in the display area (see 
Section 4.3). In the following we describe these view options: 

• Detailed View. This option allows the user to view the internal, exter- 
nal and help links (see Figure 4). 

• Selection Link types. The user can select whether all links i.e. internal, 
external or help links (see Section 4.2) should be displayed (or any 
subset). This option can be used only with the detailed view option 
e.g. to visualize all the external links (or any subset) starting from a 
selected web server. 

• Global overview. In order to reduce the complexity of the graph we use 
small icons for this option (see Figure 5). All server start documents 
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Figure 5: Matrix and circular layout 

will be displayed without links in a matrix form. All documents that 
are directly referenced from a server start document will be depicted in 
circular fashion around this document. The other indirectly referenced 
documents appear in a bigger circle around the server start document. 
By passing a matrix node with the mouse all the links related to this 
node will be temporary presented. This enables the user to explore 
the link structure of the cached data. 
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5 Conclusion 

In this paper we have described CineCat which is a tool that gives the user 
the ability to visualize and categorize the explored web space. This web 
space is identical to the client side cache. CineCat can visualize the link 
structure of the cached documents. By using the integrated view options 
CineCat is able to visualize a large and strongly interlinked cache (e.g. 2000 
documents). 

Since a cache can be used by several users, CineCat can be extended to 
support multi user visualization and (cooperative) categorization. This ex- 
tension is useful in order to achieve cooperative browsing where a user can 
contribute URLs to shared Cache and the URL base. 

From the security point of view this extension proposes some challenging 
problems. For example a user can have sensitive data in his local cache that 
should not be added to the shared space. The one-to-one correspondence 
between cache and explored web space is questionable in this case. 

Another interesting extension would be to extend CineCat with automated 
classification methods in order to classify web documents in a (semi-) auto- 
mated way. 
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Abstract: Currently, most curricula in university are based on independent 
courses and do not take into account existing knowledge and learning preferences 
of the students. We provide a model which tries to deal with these requirements 
and provides more flexibility. The model is based on the division of courses into 
self-contained concepts and user-speciflc navigation structures. 



1 Introduction 

The curricula of many or maybe most German universities are currently 
under discussion. On the one hand public universities have to face a growing 
competition from private universities, on the other hand there is a strong 
desire from government and industries to shorten the length of the studies 
and to allow higher flexibility and more specializations. As a consequence 
classes become more heterogeneous. Traditional ways of teaching are not 
longer sufficient to cover the continuous growing requirements on students. 
These requirements result in a number of design goals for a teaching and 
learning system: 



• Definition of well defined units for knowledge transfer, which have 
a much smaller granularity than a course to allow flexibilization of 
courses. 

• Description of the pre- and postconditions (required knowledge, ac- 
quired knowledge) of the knowledge units, consideration of existing 
knowledge and learning preferences of the students by presenting tai- 
lored course material. 

• Reuse of existing course material. 

• Development of tools for students and lecturers. 



The following sections describe the needed definitions and terms, the result- 
ing structures and an architecture for an implementation. The paper finishes 
with a discussion of the results and topics for further work. 
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2 Definitions and Terms 

First, the terms we need to describe our approach are defined: 

Concept: Smallest unit of knowledge, describes a self-contained, indepen- 
dent idea, can be identified through a keyword, requires often the 
knowledge of other concepts (pre-requirements) 

Course Unit: Presentation which mediates one or more concepts 

Presentation Unit: physical representation of one or more concepts 

Course: Sequence of course units which follow a certain theme or teaching 
goal, realized as a path through the knowledge pool 

Knowledge Pool: Contains all concepts which are used in the courses of 
a department 

Course units and courses are conflict-free preselected navigation paths over 
a set of concepts. A navigation path is conflict-free, if none of the pre- 
requirements of the covered concepts is violated. An example for the above 
categorization would be: 

Concept: Polymorphism (Pre-Requirements: Class, Object, Type) 

Course Unit: Object-Oriented Analysis 

Presentation Unit: Slides and handouts for the concept “Polymorphism”. 
Course: Systems Analysis and Design 

Knowledge Pool: Courses of the Department of Information Systems and 
Software Techniques 

When the course material is organized in small, conceptual units the focus on 
knowledge transfer shifts from course-based knowledge transfer to concept- 
based knowledge transfer (see Figure 1). 

Typically, a student takes several mostly independent courses. Using a 
concept-based approach, this distinction between different courses vanishes, 
and the student is able to progress through the course units based on her 
experience and preferences. 

This concept-based organization of the knowledge pool is similar to the 
proposal by Pilar da Silva et al. (1997). Our emphasis is to provide a scalable 
and extensible architecture for the realization of such a system based on open 
Internet standards. The navigation in the system is dynamically computed 
for every student based on his knowledge and his needs. The cognitive 
aspects of the actual presentation are in our paper of minor interest, not 
at least because due to the use of standard web techniques the possible 
variations in the presentation and linking of the concepts is limited. 
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course-based learning 





Figure 1: Course-based and concept-based knowledge transfer 



3 Navigation Structure 

The level of experience and the learning preferences of the students have 
a direct influence on the navigation structure. Learning preferences are 
expressed as unknown concepts rated with a level of interest and experience 
is expressed as known concepts rated with a level of confldence. Together, 
they form the user profile of a student as seen in Figure 2. 

In contrast to the course units and concepts which are mostly static, the 
navigation structure is dynamic. It changes whenever a student gains more 
knowledge about a concept or decides to change his learning preferences. 
Additional influences are the prerequisites which some concepts may have, 
resulting in additional concepts which have to be learned. Figure 3 shows 
the distinction between the static and dynamic parts of the system. 



Profile 

— User Information (Name, ...) 

— Known Concepts 

— Concept 1 X Confidence Level 
— Concept 2 X Confidence Level 

— Unknown Concepts 

— Concept 3 x Interest Level 
— Concept 4 X Interest Level 



Figure 2: Profile with known and unknown concepts 
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The part of the navigation structure which is most important are the links 
from a concept which has been covered completely to the possible following 
concepts. This set contains all those concepts with fully satisfied prerequi- 
sites (prerequisites being concepts already covered with a high confidence 
value). They are ranked by level of interest. 




navigation structure 



units and concepts 



Figure 3: Static and dynamic parts of the knowledge base 

A possible alternative to this approach would use a predefined navigation 
structure opposed to a dynamically generated structure. The basic naviga- 
tion structure could be defined by the lecturer, and modifications would be 
made by the students. The disadvantage however is that the students are 
need a certain level of knowledge to modify the predefined structure. A more 
detailed discussion of different approaches to the definition or generation of 
navigation structures is subject to further research. 



4 Implementation and Infrastructure 

The model presented in this paper is targeted at a web-based environment 
where the course material is available on a web server or distributed as 
CD-ROMs. Students are able to read the course material on- or off-line us- 
ing web-browsers. In our implementation, the Extensible Markup Language 
(XML, see Bray et al. (1997)) is chosen to structure the course units. XML 
offers more functionality as HTML (Raggett (1997)) for presentation pur- 
poses, especially in conjunction with style sheets like CSS (Lie, Bos, (1997)). 
Furthermore, it provides an excellent means to define structure and seman- 
tics of a document through Document Type Definitions (DTDs). Finally, it 
has a superior definition of link semantics (Maler, DeRose (1998)). We can 
identify four types of XML documents: 

• concepts 

• course units, 

• navigation structures and 

• user profiles. 
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The structure of each document type is defined in a corresponding DTD, an 
example for the user profile as seen in Figure 2 is shown in Figure 4. 



<! ELEMENT PROFILE - - (STUDENT, CKNOWN, CUNKNOWN) > 

<! ELEMENT NAME - - (#PCDATA)*> 

<! ELEMENT STUDENT - - (NAME, ...)> 



<! ELEMENT CONCEPT - - (NAME, ...)> 

<! ELEMENT CONFIDENCE - - (#PCDATA)*> 
<! ELEMENT INTEREST - - (#PCDATA)*> 



<! ELEMENT CKNOWN - - (CONCEPT, CONFIDENCE) *> 
<! ELEMENT CUNKOWN - - (CONCEPT, INTEREST) *> 



Figure 4: User profile DTD 



A user profile is created when the student is enrolled for the first course. 
This results in an initial listing of concepts in the profile. The concepts to 
be learned in the course correspond to the CUNKNOWN elements and optionally 
already known concepts can be added as CKNOWN elements. During the student’s 
progression through the course, the system updates these elements and changes 
concept references from CUNKNOWN to CKNOWN, as the student passes the corre- 
sponding control questions. 

As mentioned above, the system will be used via a web browser. The resulting 
infrastructure is shown in Figure 5. In this scenario, the web server dynami- 
cally generates the user-specific navigation structures based on a unique user ID 
(implemented using the cookie mechanism, see Kristol, Montulli (1997)) and a 
corresponding user profile. The generation of the navigation structure is handled 
by CGI scripts. The resulting documents are XML documents, but it is possi- 
ble to convert them to HTML documents before transmission to the client. This 
allows the usage of browsers which are not XML/CSS-enabled. 




Figure 5: Infrastructure 

A more flexible approach however is the usage of a web browser which is able to 
execute scripts. For this purpose, we are using the Cineast browser (Koppen et al. 
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(1997)). It is written in OTcl (Wetherall, Lindblad (1995)) and handles HTML 
and XML documents along with CSS style definitions. Its main advantage over 
commercial browsers is that the Cineast browser is a prototyping environment for 
web-ba^ed applications which has furthermore the ability to execute embedded 
scripts, permitting access to the document structure and the network functionality 
provided by the Cineast browser through a programming interface. This interface 
covers on the one hand most of the Document Object Model (DOM, see Byrne 
(1997)) and allows the generation and modification of XML documents. On the 
other hand, the networking layer of the Cineast browser can be accessed through 
OTcl objects, which makes it possible to fetch and store documents across the 
Internet. 

Figure 6 shows the architecture usage of the learning environment when the 
Cineast browser is used. In this approach, the generation of the navigation struc- 
ture is realized through scripts embedded in the course material. The user profile 
and the concept pool can be used off-line if the material is stored on the client 
side. 




Figure 6: Infrastructure using an enhanced web browser 

The static course material itself is created using XML/SGML enabled editors like 
Emacs, or generated from existing material, e.g. LaTeX or HTML sources. 



5 Conclusion 

It is obvious that static course material cannot meet the requirements which result 
from differing existing knowledge and learning preferences, increasing demand for 
more flexible course structures and changing course contents. We presented an 
approach that divides courses into the underlying concepts and connects these 
concepts using a dynamically created navigation structure. As a result, we gain 
the ability to create individual courses which reflect the experience and preferences 
of the students. They can be used for on- and off-line learning using web tech- 
niques such as XML, CSS and embedded scripts together with complying browser 
and server software. From the point of view of a lecturer, a pool of concepts is 
built rather than complete courses. This pool of concepts can be used as a basis 
for courses not only of a single department but of a whole faculty. 
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However, open for further work and discussion are the following aspects: 

Realization of exercises and tests: Currently, there is no mechanism for an 
automated review of exercises since this works only reasonably well with 
simple questions like multiple choice questions. Additionally, working on 
exercises in groups is difficult to implement. Here, a more sophisticated 
technical foundation is needed. Another difficulty is the organization of 
tests, where the test itself is ideally done under supervision at the university. 

Annotation of course material by the students: Students should be able 
to annotate the course material and share the annotation with other stu- 
dents. This aspect is covered in a research project currently in progress at 
our department. 

Creation of an infrastructure for a “learning community”: In conjunc- 
tion with the previous aspect, the establishment of a cooperating “learning 
community” is a long-term goal. Students should be able to share not 
only their thoughts regarding course materials through annotations, they 
should also be enabled to evaluate, discuss and criticize the work of the 
department, providing valuable feedback for improvements. 
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Abstract: This paper presents results regarding the performance of multi- 
dimensional scaling (MDS) when used to create three-dimensional navigation 
maps. MDS aims at reducing high-dimensional space into low-dimensional 
landscapes. Combined with browsers which are capable of visualizing three- 
dimensional object information by applying the conceptual basis of Virtual Reality 
Modeling Language (VRML), MDS opens new possibilities for cognitive receptive 
navigation. Web3D, a prototype implementation using MDS, reveals the potential 
of this idea. 



1 Introduction 

1.1 The Challenge: High-dimensional Cyberspace and 
Low-dimensional Maps as Navigational Support 

In the physical world two-dimensional maps and various other symbolic 
representations of our environment are the preferred means of orientation. 
Without such devices, it is almost impossible to navigate purposefully in 
new surroundings, e.g. in a new city, or country. In Cyberspace, there is 
also a need for orientation. However, the navigational support provided by 
links, or a collection of links, derived from Web pages, or search engines is 
rather limited. These approaches resemble street signs and, as such, are not 
that helpful for high-dimensional cyberspace navigation. Even if we follow 
only a few links, we lack a concise representation of this very small portion 
of the Web. It can be assumed that navigation in high-dimensional space 
is difficult, due to the fact that human cognition has been trained, as a 
result of evolution, for daily navigation within three-dimensional space (at 
the most four-dimensional navigation, if we take time into account). The 
idea is to create a low-dimensional landscape of cyberspace which is usually 
high-dimensional, or precise portions thereof. It can be assumed that such 
low-dimensional devices are better suited to help cyberspace visitors navi- 
gate and eventually retrieve information. (Girardin (1997), Honkela et al. 
(1996)) Multivariate statistical methods aim at reducing high-dimensional 
space into low-dimensional segments. This research aims to evaluate the 
performance of multi-dimensional scaling (MDS). Combined with browsers. 
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which are capable of visualizing three-dimensional object information by ap- 
plying the conceptual basis of Virtual Reality Modeling Language (VRML), 
new possibilities for cognitive receptive navigation can be derived. Within 
this paper, the term “document” is used as a technical term for the various 
forms of abstract entities which have been investigated. A document can 
be either a plain text-file, a Web-page or a compiled set of plain text files, 
which originate from Web pages or any document whatsoever. 



1.2 Related Research 

According to Benford (1995) creating a graphical representation of generic 
data can be achieved in four ways: 

1. Data or data attributes can be mapped onto spatial (i.e. x, y, z- 
coordinates) or visual dimensions (shape, size, color, spin which are 
logically independent of the spatial position). 

2. Data can be clustered using statistical methods that reveal (semanti- 
cal) similarity. 

3. Data can be organized using a hyperstructure approach, e.g. a tree 
drawing scheme. This works quite well with data that is already hier- 
archically structured such as file systems. 

4. Data can be represented with real world metaphors, like “shopping 
mall” , “book shelves” , and “stock-room” . Problems arise if such struc- 
tures are not semantically adequate. 

Excellent collections and discussions of 2D and 3D representations of large 
document collections include 

WWW. public. instate. edu/^CYBERSTACKS/BigPic.htm 
www.dur.ac.uk/^dcs3py/pages/work/Documents/lit-survey/IV- 
Survey / index.html 

WWW . cy b ergeogr aphy. org/ at las/ at las . ht ml 

For applications based on Unified Matrix Methods see (Ultsch (1993)). None 
of these approaches uses MDS. Further methods focusing on the construction 
of lower dimension representations of Cyberspace are discussed in (Girardin 
(1995)). 

2 The Purpose of Multidimensional Scaling 

The overall goal of multi-dimensional scaling (MDS) is to analyze the sim- 
ilarities or dissimilarities of objects (often expressed as distances between 
objects) which have been observed. In general, MDS tries to re-arrange a 
set of given objects within a specified dimensional space, so that the ob- 
served distances between the objects can be reproduced with as little effort 
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as possible. A “good” result is derived, if an appropriate interpretation of 
the dimensions, axis and distances is easily achieved. In respect of the math- 
ematical properties of the MDS-procedure, the interval for the value of the 
parameter “dimension” for the space within the objects which are to be re- 
arranged is basically [1, ^objects]. However, one of the major aims of MDS 
is to reduce dimensionality or complexity and, thus, to explain the similarity 
matrix using as few as possible of the underlying dimensions. For naviga- 
tional purposes a three-dimensional space is proposed, due to the fact, that 
cognition of a visualization of three dimensions is likely, whereas cognition 
of a greater number of dimensions is relatively unlikely. For further infor- 
mation in respect of the mathematical treatment of MDS and applications, 
see (Shepard et al. (1972)). 



3 WebSD: A Prototype Implementation 
Based on MDS 

Web3D is a prototype implementation (Schoder et al. (1997)) developed at 
the University of Freiburg (IIG-Telematics, Prof. Dr. Gunter Muller) based 
on MDS. This section briefly sketches the concept and the envisioned 5-step 
procedure for navigating within a three-dimensional navigational map of a 
given set of documents. 

Stepl: Input: n documents as *.txt files. Process: Paired calculation of 
the similarity of two documents at a time. Output: similarity matrix of n 
documents. 

Step 1 runs n*(n-l) times where n is the number of documents. Finally, a 
similarity matrix for all n documents is derived. 

Step 2: Input: Output of Step 1, Process: Galculate the three-dimensional 
configuration using MDS, Output: list of objects (the underlying documents) 
with their x, y, and z-coordinates. 

Step 3: Input: Output of Step 2, Process: Gonvert the three-dimensional 
coordinates in VRML syntax. Output: File in VRML-Syntax 
Within step 3 the information regarding the three-dimensional coordinates, 
along with additional information regarding the underlying documents (e.g. 
length of document, age, type, original web-address etc.) are merged and 
converted to VRML syntax. VRML stands for Virtual Reality Modeling 
Language and has recently become an ISO standard known as VRML97 
(see www.vrml.org). It is a platform-independent standard for the exchange 
of three-dimensional object and space information over networks and offers 
features for the interaction of users within such spaces. (Diehl (1997)). 

Step 4: Input: Output of Step 3, Process: Visualization of the VRML-flle, 
Output: three-dimensional space with visualized objects within a VRML- 
browser. 

For the visualization of the VRML-flles, standard plug-ins for common 
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Web-browsers are used, in particular the Cosmo-Player 2.0 by SGI for the 
Netscape browser (see www.sgi.com). The visualization process takes only 
seconds. 

Step 5: Based on step 4 the user is now capable of navigating within the visu- 
alized document space. Basic navigation functions include zoom and various 
ways to move into or around the collection in any direction. Additional fea- 
tures (which have not yet been implemented) include pre-defined views for 
collection, collision detection while browsing within the three-dimensional 
space, video and audio animation, or billboards, i.e. information presented 
on boards which are continuously directed towards the view angle of the 
user navigating the space. 



4 Initial Experiences with WebSD 

Two 3D-maps as examples are presented in order to see the limitations, as 
well as the potential benefits, of a more sophisticated future version (see in 
particular the discussion section 5). 




Figure 1: 18 documents of distinctly different types of sources 



Figure 1 depicts a typical view of a three-dimensional document space (here, 
due to the restrictions of the media, broken down to two dimensions). On 
the left-hand side, in an extra browser window, the file names of the docu- 
ments are listed. For presentation purposes, the different types of documents 
correspond to different geometric primitives. The cones represent four doc- 
uments related to descriptions of the movie “Conair” , whereas the cubes in 
the lower right area of the picture represent product descriptions of Mer- 
cedes cars. In the upper right-hand corner, there are product descriptions of 
Porsche cars. Towards the middle, the underlying documents of the objects 
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are car descriptions of various BMW automobiles. As might be expected - 
and it is possible to recognize this in Figure 1 - a certain clustering of the 
various types of the documents can be seen. The right-hand side of Figure 1 
shows the browser window. In the lower part, the user has access to various 
navigation functions. 

Figure 2 illustrates a more practical application scenario. The aim is to 
visualize the “landscape” of a selected list of marketing departments in Ger- 
many. The document collection was taken from a recently published book 
on “Computer Based Marketing” (Hippner et al. (1998)). One chapter of 
this book presents marketing departments or departments which relate to 
market research. In total, 66 presentations were processed electronically in 
the manner described above resulting in Figure 2. To be seen is a relatively 
regularly distributed cloud of spheres which creates a larger sphere. There 
is almost no recognizable clustering. Additional analysis revealed that the 
self-description of the marketing departments are syntactically extremely dif- 
ferent. The applied algorithm did not recognize similar documents. Thus, 
there is little distribution in the similarity values among the documents. See 
section 5.1 for potential remedies. 




5 Discussion 

5.1 Similarity of Documents 

A good measure of similarity of documents is essential. For the prototype, 
a relatively simple word by word comparison is used. Two documents at 
a time are compared. If a word exists in both documents, the word count 
is increased. It is assumed that the higher the word counts the higher the 
degree of similarity is. To improve the results a stop list is applied and 
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the “Porter Rules” (Porter 1980) for powerful stemming is used for English 
texts. In addition, a weighting scheme of discriminatory terms is processed. 
For this, the well-known term frequency times inverse document frequency 
(tf-idf) algorithm is applied (Salton 1989). Example 2 illustrates a major 
challenge for navigation and information retrieval: If there is not a “good” 
calculation of the similarity of objects, any procedure using the far from 
perfect data will not be able to visualize existing structures. A future imple- 
mentation of Web3D will incorporate more sophisticated methods including 
Al-methods and more linguistic concepts. The author is not aware of any 
publicly available implemented function which is equivalent to Porter Rules 
for powerful stemming of German texts. This would radically improve the 
simplistic approach of using word-by-word comparison. 



5.2 Large Document Collections 

The minimal extension of mapping equally distant objects to a 3D- 
configuration while preserving the distance measure between all these ob- 
jects results in a sphere. In case of large document collections (>100) this 
might lead in MDS-derived configurations to more or less regular distributed 
objects on the border of the resulting sphere hindering cognitive receptive 
navigation. Further research has to show the benefits of the proposed MDS 
approach while using large document collections. 



5.3 Visualization 

The size of the document collection and the distribution of the similarity 
values strongly determine the overall shape of the derived three-dimensional 
space. From time to time this may create problems for an appropriate nor- 
malization of the size of the visualized objects in relation to the “whole” 
configuration space. Currently, the parameters for normalization are man- 
ually adjusted. Ill-defined normalization parameters will either lead to a 
chunk of objects which portrays a greater relation between the objects than 
is actually present, or a relatively broad distribution of objects within the 
configuration space, with the result that close relationships between docu- 
ments are hidden. 



5.4 Browsing Through 3D-maps 

More information on the underlying documents may be incorporated into the 
space. VRML has several features which support, for example, “billboards” 
which allow the information relating to objects to be presented once the user 
browses close to the object. 
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5.5 Performance 

The comparison of documents in pairs in order to derive the similarity ma- 
trix yields an average case complexity of n*(n-l)/2. Considering several 
hundred documents with several thousand words to be compared, presents 
a huge calculation task that exceeds todays PC-performance. Thus, WebSD 
applications are limited to small document collections, or at least to non- 
real-time applications. One approach to avoid the explosion of necessary 
CPU time for calculation is to use reference documents: If a new document 
of a given collection is not similar to any of a pre-calculated set of reference 
documents this document might be excluded from any further step in the 
calculation. 

5.6 Usability 

Comments made at presentations of Web3D highlighted the need for individ- 
ualization. E.g., a user- defined reference document might be visualized in a 
particularly noticeable color, or with a characteristic geometric shape. Thus, 
an immediate impression of the relation (distance) of all other documents 
to this reference document would be possible. 

5.7 Possible Applications 

Applications include the business case of seeking and matching business 
partners. As sample Web pages (or documents in general) serve as the per- 
sonal home pages of a certain group of persons, a specific domain can be 
transformed into a three-dimensional map by using the multivariate statis- 
tical method referred to above. The more similar the Web pages are, the 
more closely they are arranged to each other. Assumed, real persons are 
behind the Web pages, it might be beneficial to those persons whose Web 
pages are quite similar to be able to contact each other. It is likely that 
they share the same interests. Furthermore, value-added services might be 
created. For example, software agent technology can be used to monitor 
the map. According to changes on the map, the relevant persons can be 
informed about a newcomer, or significant shifts of individuals, in respect of 
their home pages, could be announced. In this respect, automatic monitor- 
ing of specific domains might be feasible. 



6 Conclusion 

In principle, multi-dimensional scaling (MDS) is a powerful procedure which 
can be used to reduce complexity using a given set of similarity measures of 
objects. Web3D, a prototype implementation based on MDS, was applied to 
several collections of documents. The two examples briefly introduced pro- 
vide food for thought regarding the potential applications and limitations 
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of MDS-based three-dimensional maps. The quality of the information pro- 
vided by three-dimensional maps is heavily dependent on the pre-calculation 
of the similarity of documents. MDS is scarcely capable of correcting a bad 
similarity measure, or an ill-defined similarity matrix. On the other hand, 
initial tests, based on “good” similarity measures, are promising. MDS can 
provide a powerful tool for deriving three-dimensional maps for cognitively 
receptive navigational support and, as a result, it may eventually lead to an 
independent approach to navigation in high-dimensional cyberspace. 
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Abstract: Global markets and worldwide collaborative networks demand a new 
approach to knowledge acquisition, distribution and use. Teaching and learn- 
ing are no longer restricted to school and university alone but become a lifelong 
challenge. Organizational knowledge management evolves to a strong success fac- 
tor in competitive markets. As large parts of organizational knowledge exist in 
the form of documents, not structured data in database systems, the methodical 
analysis and structuring of documents becomes an important issue for corporate 
information management. The potential of semantically structured documents 
as a platform for cross media organization of corporate knowledge components is 
demonstrated by the research project ELBE. 



1 Knowledge Management and Knowledge 
Acquisition 

As an effect of the globalization of business processes and of product and 
market information in the last years product life cycles and the time to mar- 
ket became shorter while product and service quality had to be improved. 
It gets more and more difficult for vendors to distinguish their products 
and to transfer their USP (unique selling position) to their customers. The 
fast growing amount of globally available, immediately accessible but only 
in some cases really needed information calls for systematic information fil- 
tering, evaluation and application, thus providing reusable knowledge. Its 
management turns out to become future’s key success factor in market com- 
petition and a very important platform upon which temporary cooperation 
and virtual organizations can be built (Schneider (1996)). When we talk 
about knowledge we should make the following distinctions: 

1. Data are stored facts and rules, which are not linked to individual 
tasks or decision problems and thus not (yet) contain semantic. 

2. Information consists of “rich data” applied to a specific context, thus 
possessing a certain semantic. 

3. Knowledge is specific information linked to processes, which is under- 
stood by the user and is in active use and change. 
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Knowledge derives from the repeated use of information in an appropriate 
situation. It represents experience gained over time or is the result of logical 
analysis of the effects of the use of specific information. Knowledge can be 
classified into individual or collective, explicit (“disembodied”) or implicit 
(tacit or “embodied”) parts. Knowledge as a whole has to be isolated into 
specific objects to become accessible by information technology. We refer to 
these objects as knowledge components. A company’s organizational know- 
ledge base consists of the following levels: 

1. Explicit and documented knowledge, 

2. Explicit but currently not collectively accessible knowledge (due to in- 
formation and communication pathologies; for example the content of 
a certain process description in the paper based environmental manual 
of a company which is remembered but cannot be retrieved ad hoc), 

3. Tacit but in principle explicable knowledge and 

4. Individual and collective tacit knowledge (Rehauser and Krcmar (1996), 
Schneider (1996)). 

The first three levels contain addressable knowledge objects. These com- 
ponents can be modeled by information management methods and tools 
(discussed in the following section). When thus explicated, they form an 
accessible organizational knowledge base for information retrieval processes 
triggered by actual business decision situations. The knowledge components 
on the fourth level are linked to personal experiences and personal or collec- 
tive processes. They can only be isolated and then in some cases documented 
and added to the explicit organizational knowledge base by procedural mea- 
sures, especially organizational learning concepts (Schneider (1996)). Level 
4 knowledge and its social, psychological and educational problems are not 
considered further in this paper. 

The demand for lifelong learning forces both schools and companies to re- 
think traditional methods of teaching and learning. Information technology 
has a strong influence on the management of organizational learning pro- 
cesses. The use of modern information infrastructure, especially CSCW 
tools (computer support of cooperative work), and of CBT systems (com- 
puter based training) leads to a virtualization of education and supplements 
traditional instruction processes. Teaching and learning no longer need to 
take place under synchronous conditions of space and time. In organizations 
individual learning becomes a more and more integrated part of collective 
learning processes (Probst and Biichel (1994)). 

Figure 1 shows three different views upon knowledge organization. Corpo- 
rate management describes strategic goals, which demand specific organi- 
zational knowledge. It has to identify the relevant knowledge components 
and synchronizes between problem and knowledge structure. The organiza- 
tional learning perspective concentrates upon individual and collective know- 
ledge acquisition by providing the appropriate motivational, communicative 
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and cooperative background needed for successful collaboration in working 
teams. Information management has to cope with the technical aspects of 
modeling data, document and process structures by the use of appropriate 
methods. The task is to provide an open data, document and process archi- 
tecture as a common platform for application systems managing corporate 
knowledge. Already explicitly known information has to be structured, in- 
formation barriers to explicit, but currently inaccessible knowledge have to 
be reduced and parts of tacit knowledge have to be explicated by means 
of information analysis and business process engineering and modeling to 
provide a transparent and collectively accessible organizational knowledge 
base. 
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Figure 1: Different views upon organizational knowledge 



2 Information Management and Knowledge 
Organization 

An efficient information management coordinates and combines business and 
information strategies. It derives support measures for the information and 
communication processes in organizations by provision of the appropriate in- 
formation infrastructure (i.e. application systems, rules, methods, tools and 
skilled personnel). Within companies information management strives after 
process rationalization, outside it supports marketing to achieve strategic 
competitive advantages. Figure 2 describes the traditional role of informa- 
tion management within organizations in interaction with corporate man- 
agement and it’s goals. Information systems as the information management 
objective consist of a number of tasks, which are solved either by computers 
or by humans or in combination. Application systems cover the tasks that 
can be automated by computers. Normally, there remain a number of tasks 




413 



outside application systems that still need human interaction. Therefore, 
information management has to provide both organizational and technical 
solutions to support information systems (i.e. business process reengineering 
combined with the implementation of specific application and communica- 
tion systems for task automation and work group support (Schoop (1998)). 




Figure 2: The role of information management in organizations 

By which organizational means and by which methods and tools can in- 
formation management be improved to support the growing importance of 
knowledge management in corporations? Besides the isolated use of expen- 
sive expert systems based upon domain specific knowledge engineering, last 
decade’s discussions in information management mainly focussed upon pro- 
cess rationalization and application integration. The development of new 
methods concentrated upon data and process engineering and upon cus- 
tomizable standard applications with access to integrated corporate data 
base systems. We name these applications “business data management” 
and “engineering data management” systems in delimitation from “docu- 
ment management” systems. The importance of documents, business and 
technical documentation for information management was long neglected. 
Actual estimations that more than 70 percent of the information used in 
business processes are contained in documents, not in data bases, lead to 
a strong demand for procedural integration of document and data manage- 
ment efforts (Reinhardt (1994), Mertens and Morschheuser (1994)). 

But this is not enough to release information objects from their enclos- 
ing documents. To improve automatic accessibility of organizational know- 
ledge, we suggest that procedural document management systems (support- 
ing imaging, archiving, retrieval and workflow processes) should be supple- 
mented by building up a semantically structured organizational document 
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base. The explication and linking of important within-document compo- 
nents by use of specific meta models allows to generate hypertext systems 
deploying an improved accuracy in identifying and addressing knowledge ob- 
jects. The hypertext metaphor adds an intuitive multimedia user interface, 
high navigational potential and organizational flexibility to handle multiple 
interconnected information objects. 

To automate information processes upon a structured, modular document 
base, hypertext/hypermedia systems can be generated by applying the in- 
ternational document standard SGML and its derivative XML, which is 
currently under discussion, upon the already existing documentation (for 
details about SGML see the literature discussed in Schoop and Schraml 
(1996)). The technical idea to implement advanced hypertext systems upon 
SGML structures was already described in 1990, when Halasz and Schwartz 
first discussed the Dexter reference model of open hypertext systems. They 
suggested a three times layered architecture with an abstract hypertext ma- 
chine managing objects, links and attributes on the storage layer, enclosed 
by a user oriented run time layer and a within-component layer. The latter 
should contain document components with structures to be explicated in 
SGML (Grpnbaek and Trigg (1994), Halasz and Schwartz (1994)). 

The currently favored hypertext markup language HTML represents rather 
the layout, but not the content structure of documents and should therefore 
not be used for automatic content interpretation. In contrast to HTML, 
SGML and also its compatible ^4ean version” XML both provide us with 
technical means to semantically represent document components and to 
make them thus accessible by hypertext application systems. But to be 
efficient, both standards first demand intellectual effort concerning the struc- 
ture explication. This process has to answer questions like how to identify 
the information really needed by the user, how to define and to describe the 
appropriate document and component structure and how to decide about 
the necessary granularity of the nodes and the density of the knowledge 
network. 

We suggest a stepwise procedure to solve these problems. First, information 
systems, their data and especially their documents must be analyzed out of 
the application and information demand view and the information needed 
must be identified and isolated. Then, semantic structure models have to 
be specified and applied to the appropriate document classes to produce the 
semantic framework for the documents, which are correspondingly tagged 
in the last step. We call this methodical approach “document engineering” 
in analogy to the well-known term of “data engineering”, which describes 
the systematic development of entity relationship models for organizational 
data architectures (Schoop (1998)). 

The main focus of document engineering lies upon the early analytical phase. 
It demands both fundamental knowledge in the business application domain 
and strong experience in document structuring. Here we apply the Infor- 
mation Mapping method developed by Horn (1989). As a result of this 
methodical document structuring process we get clearly isolated and seman- 
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tically typed information blocks and maps as small knowledge components. 
In the second step these modules must be hierarchically structured and 
linked into larger semantic models, combining both the bottom up approach 
(document view) and the top down approach (user’s view). For structure 
representation we refer to the SGML standard and describe these models in 
forms of DTD’s (document type definitions). They describe the application 
view, build a corresponding meta structure of the relevant document classes 
(e.g. information about the content components within product manuals, 
technical or organizational handbooks, representing product knowledge and 
procedural knowledge) and can serve as navigational elements in a hyper- 
media knowledge management system. 

Figure 3 shows our proposal for a stepwise approach to SGML or XML based 
document and knowledge management. 



document engineering 



analysis 



Information demanded by users* 
functionality offered by providers* 
environmental restrictions 

Identification and classification of 
potential semantic components, 
selection of relevant semantic 
components 



DTD 

engineering 




modelling the hierarchical structure 
of the document class* 
modelling of elements* attributes 
and entities* 

designing and linking of modules 



I _ production & 
distribution 




reengineering of existing documents, 
production of new documents, 
organization of the document base* 
application development 



Figure 3: Methodical steps of document engineering 



3 Project Example: Structured Hypermedia 
Textbooks 

We instruct our students in the theory and practice of information man- 
agement, especially document, process and knowledge management. Be- 
sides instruction, the students have to train their methodical knowledge and 
their skills in applying appropriate tools in orientation to the demands of 
their later information management jobs. To gain and to deliver practical 
experience in document engineering methods and in the development and 
use of hypertexts exploiting structured documentation, and to demonstrate 
the method’s potential for building up organizational knowledge bases and 
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for organizational learning, we started in 1997 our research project ELBE. 
The acronym means the development of electronic lectures for business 
education. 

This hypertext project is still in an early state. It follows two objectives. 
The first issue is to demonstrate the potential of SGML based document 
engineering for cross media publishing of different learning materials out 
of a single source. The second issue is to implement an open collaborative 
hypertext system upon the structured document base to support information 
retrieval, self guided learning and a CSCL (computer supported cooperative 
learning) environment. 

Figure 4 demonstrates how we build up, organize and manage our knowledge 
base about the information management domain for educational purposes. 
As first project benefit we expect a well-structured paper based documenta- 
tion of the lectures’ contents, providing easier access for learning and for ac- 
tualization purposes. Second, the electronic hypertext version gets intercon- 
nected with further publications like diploma thesises and working papers, 
computer based training materials, frequently asked questions and answers 
and links to supplementary web sites. Orientation and navigation helps (the- 
saurus, descriptors, indexes, glossaries and additional explicit links) shall 
add further value to the electronic version. Thus ELBE is both a research 
and collaborative training object for document engineering and knowledge 
management methods. 
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Figure 4: Stepwise provision of knowledge components in the ELBE project 



4 Conclusions 

Knowledge management is a combined economical, organizational and psy- 
chological issue. Information management provides methods and tools ap- 
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propriate to support the organization of the corporate knowledge base. A 
stronger regard of knowledge components already existing in business and 
technical documentation and their structured representation by the use of 
document standards like SGML can help to make larger parts of the ex- 
plicable organizational knowledge base accessible for automated processing 
through hypertext application systems. The first prototype of our research 
project ELBE already demonstrates this potential. Future versions shall 
give perspectives for methodical and technical improvements. 
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Abstract: The information how probable it is that a given company becomes 
insolvent is important for owners, creditors and other financiers of this company. 
Especially investors are in need of this information to calculate and control the 
risk they take with an investment decision. We show in this paper how the proba- 
bility of corporate failure can be measured with artificial neural networks (ANN), 
namely mixture- of- experts networks. With the help of 8,660 financial statements 
of 3,125 industrial companies we developed a mixture-of-experts network that is 
able to classify 90% of all companies which became insolvent within the next three 
years correctly; the corresponding misclassification rate of actually solvent firms 
is only 29% (Jerschensky (1998)) 



1 Insolvency Risk and Probability of Insol- 
vency 

If risk is defined as the danger of suffering losses induced by a certain de- 
cision, an investor faces a variety of risks. One of these risks is the risk, 
that money will be lost because the company, in which the investor is en- 
gaged, goes bankrupt. This risk is called insolvency risk. The measuring 
and controlling of insolvency risk has gained more and more importance in 
recent years. The number of corporate failures in Germany has risen every 
year since 1991. Starting with 8,837 insolvencies of companies in 1991 this 
number reached a new record level in 1997 when 27,485 companies failed 
(Statistisches Bundesamt (1998)). Hence investors as well as creditors need 
an instrument that helps them to predict failures of the companies they own 
or to whom they lent money. 

First it has to be determined what this instrument should measure exactly. 
This can be derived from the concept of risk the investor has. According 
to the general definition of risk given by Kupsch (1973) insolvency risk can 
be defined as a combination of a possible loss in case of insolvency and the 
danger of insolvency. A simple measure for this risk is 

insolvency risk amount of invested money * probability of insolvency. 

( 1 ) 

So an instrument is needed to measure the danger of insolvency as the prob- 
ability of insolvency of a given company or a set of given companies. This 
holds true not only if the expected loss is used as a measure for insolvency 
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risk as we did in (1) but also if any other risk measure (see for a synopsis 
Brachinger, Weber (1997)) is preferred. 

Furthermore the concept of probability represents correctly the empirical 
phenomenon “danger of insolvency” for two reasons (Jerschensky (1998)): 

1. Every company is immortal in principle. It depends solely on the 
future decisions and the financial capability of investors and creditors 
whether a company will become insolvent or not. This means that a 
company is at a certain point in time not insolvent to a certain degree, 
as for instance a fuzzy concept would assume, but it is insolvent or 
not. Only the chance that a given company might become insolvent 
in the future is bigger or smaller. 

2. The main reason for every insolvency are disastrous management de- 
cisions (Krystek (1987)). Due to the stochastic nature of the future 
it cannot be foreseen - and in most cases is not even determined - 
whether a management decision is a fatal error or not. 

2 The Problem of Measuring the Danger of 
Insolvency 

The property “danger of insolvency” of a company can neither be observed 
directly nor represented (i. e. measured) directly as probability of insolvency. 
This property must be reduced to observable features of a company in a first 
step. This is done by generating working hypotheses on how the property 
“danger of insolvency” is related to observable features such as components 
of financial statements. In a second step these features are measured for 
existing companies. The third step consists of measuring the probability of 
insolvency on the basis of the measured features. Hence the probability of 
insolvency we are looking for is a a posteriori probability in the Bayesian 
sense (Duda, Hart (1973)). The process of measuring the probability of 
insolvency is demonstrated in figure 1. 

We started from the assumption that ‘‘The more risks a company can take 
and the fewer risks a company has taken the better it can avoid insolvenc'if ^ . 
In a first step we broke this basic hypothesis down in subhypotheses such as 
“The lower the ratio of liabilities to capital is, the more stable is a compan'if\ 
For the sake of objectivity, comparability and availability we chose only 
hypotheses on the ground of financial statements and the national economic 
situation. Thus each feature vector consisted of 53 entries: 48 basic financial 
ratios, four economic indicators and the class of turnover (“small”, “middle- 
sized” and “big” as defined in § 267 HGB (German Gommercial Gode)). In 
the second step we gathered the data according to the hypotheses generated 
in step one for 334 later insolvent and 2,791 solvent industrial companies. 
With the help of ANNs we accomplished step three and built an ANN that 
is able to measure the a posteriori probability of insolvency of industrial 
companies quite accurately. 




423 




Figure 1: Measurement of the a posteriori probability of insolvency. 



3 The Mixture-of-Experts Topology 

The outputs of an ANN can be interpreted as estimators of Bayesian a pos- 
teriori probabilities if the desired outputs are 1-of-n coded (one output unity, 
all others zero) and if a squared-error or cross-entropy cost function is used 
for training (Richard, Lippmann, (1991); Hampshire, Pearlmutter (1991)). 
The mixture-of-experts topology is a kind of ANN for which these require- 
ments are fulfilled. 

This topology was first presented in Jacobs et al. (1991) and is based on the 
idea that a complex task can be solved more easily if it is divided into simpler 
subtasks. A mixture-of-experts network consists of a group of networks (the 
so called “experts”) competing to learn different aspects of a problem. A 
gating network controls the competition among the expert networks and 
learns which expert network should dominate the output of the hole mixture- 
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of-experts network when a particular input is presented. Thus a mixture- 
of-experts network learns that different input patterns belong to different 
subtasks (Jacobs, Jordan (1991)). The general architecture of a mixture-of- 
experts network is shown in figure 2. 




X 



Figure 2: Architecture of a mixture-of-experts network. 



Both the expert networks and the gating network are fully connected to 
the input layer. The expert networks and the gating network are trained 
separately but simultaneously. Each expert network is trained as if it were 
the only network, i. e. each network is to provide the expected output of 
the hole mixture-of-experts network. The goal of the training of the gating 
network is to have the gating network learn an appropriate decomposition 
of the input space into different regions by assigning the responsibility of 
generating the output for different input regions to different expert networks. 
This goal is achieved by using a special error function. This error function 
is 

E := (2) 

3 

where Qj is the weight the gating network assigns to the j** expert network, 
y* is the desired output, yj is the output of the expert network and n is 
the number of expert networks (Jacobs et al. (1991)). The gating network 
and the expert networks are trained by back propagating the error. The 
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weights are updated using an update rule such as the delta rule (Rumelhart 
et al. (1986)). 

The architectures of each of the expert networks do not have to be identical. 
If there is a priori knowledge of the structure of the task the network should 
solve, e. g. that it consists of a linear and a non linear subtask, then the 
architectures of the expert networks should be chosen accordingly. In most 
cases, however, such a knowledge is not at hand and the same architecture 
is chosen for all expert networks. 

The architecture of the gating network is more restricted. The number of 
output neurons must match the number of expert networks and the sum of 
the activation of all output neurons must equal one. Often a softmax acti- 
vation function (Bridle (1989)) is used to meet the last-named requirement. 



4 Measuring the Probability of Insolvency 
with a Mixture-of- Experts Net 

4.1 The Structure of the Empirical Data 

We wanted to train a mixture-of-experts network to estimate the a posteriori 
probability for the event ^Hnsolvency within the next three years’\ Therefore 
we needed the financial statements of later insolvent and solvent compa- 
nies to train a network, to test the parameters regarding the architecture 
and the learning of the network respectively, and finally to validate it. We 
split the hole set of available financial statements consisting of 684 financial 
statements of later insolvent companies and of 7,976 financial statements of 
solvent companies in three sets as follows: 





Training set 


Test set 


Validation set 


Insolvent 


342 


171 


171 


Solvent 


342 


3,817 


3,817 



Table 1: Structure of the empirical data. 



4.2 The Structure and Performance of the Best Net- 
work (MoE-31) 

We started with a series of mixture-of-experts networks which contained 
up to ten experts (each a two-layered multi-layer perceptron) and various 
numbers of hidden neurons. After training and pruning we ended up with 
the best network being a mixture-of-expert network that consisted of three 
expert networks. Each of the expert networks has two neurons in the hidden 
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layer whereas the gating network has three neurons in the hidden layer. The 
input vector of this mixture-of-experts network was composed of 31 features 
(26 financial ratios, the class of turnover and the four economic indicators). 
Thus we chose the name MoE-31 for this network. 

There are two kinds of possible misclassification in a two-class classification 
problem. It is possible that a later insolvent company is classified as solvent 
(called a-error) or that an actually solvent company is incorrectly classified 
as insolvent (called /?-error). These two kinds of errors must be treated sep- 
arately since they have diflFerent economic consequences. If the companies 
of the validation set are classified by the decision rule “// the a posteriori 
probability of insolvency is greater than x% then classify the company as 
insolvenf and if the a-/^-error combinations for all x G [0,100] are cal- 
culated then we get a characteristic a-//?-error curve for a given classifier. 
Table 2 contains selected points of this error curve for MoE-31 measured on 
the validation set. 



a-error 


5^ 


8.7% 


1^ 


20% 


o 

CO 


40% 


50% 


60% 


;0-error 


38.3% 


31.7% 




17.3% 


11.6% 


8.1% 


5.1% 


3.5% 



Table 2: Performance of MoE-31. 

According to the characteristic a-//?-error curve the MoE-31 performances 
quite well. Even if only 8.7% of all financial statements of actually insolvent 
companies are misclassified the fraction of financial statements of actually 
solvent companies that is classified as “later insolvent” by mistake is only 
31.7%. When we started our research in the late 80’s the /?-error according 
to an a-error of 8.7% was about 49% - achieved with a linear discriminant 
function on a different set of data (Baetge (1989)). Another and more robust 
measure of performance is the area below the a-//?-error curve (Eberhart et 
al. (1990)). This area amounts for MoE-31 to 10.76% measured on the 
validation set. 

To get a fair comparison for the performance of MoE-31 we trained a series of 
single multi-layer perceptrons with the same data sets (training set and test 
set) and the same feature vector. The best multi-layer perceptron achieves 
an /J-error of 38.5% when the a-error is fixed at 8.7%. The area below the 
a-//?-error curve for that net is 12.31% - also measured on the validation 
set. The performance of this benchmark-network shows that the mixture-of- 
experts topology performs clearly better than a single multi-layer perceptron 
does. 



5 Analysing MoE-31 

As mentioned above, a mixture-of-experts network allocates its different 
expert networks to different subtasks. To verify if such a specialisation of 




427 



the expert networks actually took place in MoE-31, we plotted the weights 
Qj the gating network assigns to the outputs yj of each expert network j 
- when the training sample is presented - against one another. Hence each 
dot in the resulting scatter plot (given in figure 3) represents one vector of 
the training sample. 




Figure 3: Scatter plot of the weights the gating network assigns to the 
outputs of the three expert networks. 

Figure 3 shows that there are data sets where one expert network dominates 
the output {qj > 0.5) whereas the other two expert networks are only of 
minor importance for the output of MoF-31. Especially expert network 1 
and expert network 3 are correlated negatively. This means that there are 
companies for which expert network 1 delivers the best results and that there 
are companies for which expert network 3 is the better judge. 

In order to answer the question which different subtasks the MoE-31 has 
identified we analysed with the help of the data of the training set the 
correlation among the 31 elements of the feature vector and the weights 
Qj. This correlation analysis revealed that the strongest correlation exists 
between the size of a company measured in terms of class of turnover and 
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the choice of the expert network expressed by Qj. The Spearman correlation 
coefficient between the class of turnover and Qj was -0,915 for the first expert 
network (j = 1), -0.845 for the second {j — 2) and 0.92 for the third expert 
network [j = 3). This means the bigger a company is the less responsible are 
the first and second expert network respectively for the output of MoE-31 
and the more responsible is the third expert network for the measurement 
of the a posteriori probability of insolvency of this company. Since we used 
ratios of components of financial statements and not the components as 
absolute numbers, this result shows that there are significant differences 
between the financial ratios of small, middle-sized and big companies which 
are independent of the probability of insolvency of these companies. This 
finding is backed up with regard to capital structure ratios by a recent study 
of the European Union that deals with the corporate financing of European 
companies (Delbreil et al. (1997)). It is a result of this study that companies 
of different size have different capital structure ratios. Thus the effect of the 
size of a company on its financial ratios must be neutralized if the probability 
of insolvency is to be measured accurately. 
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Abstract: The aim of this paper is to compare different methods of forecasting 
exchange rates using methods of multivariate data analysis. Traditional forecast- 
ing models like the random walk, prominent structural models and approaches 
using the forward rate to forecast the spot rate are evaluated. Time series analy- 
sis is conducted employing univariate time series models as well as multivariate 
time series models and error correction models. Model identification, estimation 
and forecasting are examplified using the DM/US-Dollar exchange rate. Fore- 
casting performance is measured by different criteria, within the scope of the 
investigation methods of multivariate data analysis are efficiently employed. 



1 Introduction 

A couple of papers (e.g. Cornell (1977), Frenkel (1976) and Mussa (1979)) 
come to the unsatisfying conclusion that flexible exchange rates follow a 
random walk. If this was true, economic theory could not contribute any 
value to making forecasts of exchange rates. For this reason this paper will 
first provide an empirical test of the random walk hypothesis. The rejection 
of the random walk hypothesis is a necessary condition for developing usefull 
forecasting models (see Heri (1982), p.l57). 

From an economic viewpoint it seems quite plausible that bilateral exchange 
rates - though highly volatile - do not withdraw too far from the fundamen- 
tal background of the corresponding economies. The aim of the structural 
models is the explanation of exchange rate changes caused by the change of 
fundamental variables. As it is beyond the scope of this paper to revise all 
of the innumerable structural models, we follow the standard approach of 
Meese and Rogoff (1983) and investigate only three basic models, which are 
the monetarist Chicago model, the Keynesian model and the Hooper-Morton 
model. 

Along with the structural models there were attempts to use the forward 
exchange rate to help forecast the spot rate in the seventies and early eight- 
ies. In this paper we use the model of Frenkel (1976) as well as the forward 
rate itself to predict the future spot rate. 

Usefull macroeconomic approaches are missing to quantify the dynamics 
of international currency exchanges. Time series analysis offers a solution 
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for this problem. Especially multivariate time series analysis does not only 
model each variable‘s dynamic, but also captures the different variables^ in- 
teractions in the course of time. The investigated time series approaches in 
this paper are the autoregressive model, the moving average model and the 
combined ARMA model. When applying these models to the data under 
investigation we take a slightly different approach than the standard liter- 
ature (e.g. Gerhards (1994)). Instead of trying to improve the structural 
models, we apply those techniques to the models using the forward rate to 
forecast the spot rate. As a prerequisite it has to be made sure that the spot 
and forward rate differ significantly. Only if this is the case, the forward 
rate contains information that is not yet included in the spot rate. However, 
the corresponding vector ARMA model can only be applied - as well as the 
univariate time series models - if the modelled variables are stationary. This 
is usually the case for differences but not for levels of economic variables. 
This problem can be dealt with by further improving the simple time series 
models using the so called error correction models. This type of models 
offer the advantage that not only short-term dynamics but also long-term 
equilibrium can be modelled (e.g. Engle and Granger (1987)). 



2 Database and Model Estimation 

Model identification and estimation is done using the DM/US-Dollar ex- 
change rate for the time between October 11, 1983 and May 10, 1988. For 
the investigation of the random walk properties the DM/US-Dollar exchange 
rate between January 8, 1974 and June 25,1996 is used. To confirm the hy- 
pothesis that forward and spot rate differ significantly, weekly data from 
October 11, 1983 to June 25, 1996 is used in the analysis. The data were 
provided by Datastream Ltd., London. The forward rate used in the inves- 
tigation is the 3-month future exchange rate. Gorrespondingly the 3-month 
Euro-money market interest rate was chosen. The money aggregates em- 
ployed are M3. As proxies for inflation in the USA the Gonsumer Price 
Index and for Germany the “Preisindex fiir die Lebenshaltung” were used. 
A detailed description of model identification and estimation can be drawn 
from Bankhofer and Rennhak (1997, p.4-20). The following paragraph shows 
the main results: 

(1) The random walk hypothesis can be rejected at the 5 percent level. 
Since the random walk model is usually used as a benchmark, it is 
included in the empirical analysis for comparison. 

(2) When estimating the structural models and Frenkeks model an auto- 
correlation problem becomes evident. The apparent misspecification 
of these models is dealt with following a suggestion of Granger and 
Newbold (1974, p.ll7) by reestimating them in differences. 

(3) The difference of spot and forward rate is significant at the 5 percent 
level. Therefore the forward rate contains information not yet included 
in the spot rate. 
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(4) When employing univariate time series models the logarithm of the 
spot rate is used, when employing multivariate time series models the 
logarithm of the forward rate is used additionally. At the 5 percent 
level stationarity of these time series cannot be confirmed. Only after 
taking the first differences of these series this problem is solved. 

(5) In the course of time series model identification a AR(1), a VMA(l), 
a VAR(l) and a VAR(4) modell seem adequate, as for those models 
the null hypothesis of model adequacy cannot be rejected. 

(6) To identify an error correction model this paper follows Cifarelli (1992) 
and Clarida and Taylor (1992), i.e. we are aiming at a cointegration 
relationship between spot and forward rate. At the 5 percent level a 
cointegration relationship can be confirmed for the data at hand. The 
estimation of the error correction model is done using least squares 
(cp. Engle, Granger (1987)). The model parameters are significant, 
an autocorrelation problem does not appear. 

The DM/US-Dollar exchange rate for the 98 weeks between May 17, 1988 
and March 27, 1990 is forecasted using the estimated models. Though the 
model parameters are held constant during the course of the analysis, all 
other available information is used by each date of forecast. When employing 
structural models the actually realized values of the independent variables 
are used. 



3 Criteria to Measure Forecasting Perform- 
ance 

To evaluate forecasting performance, we need a measure to compare the 
forecasts of the different models. Granger and Newbold (1986) suggest the 
“Mean Absolute Error” . It is given as follows: 

MAE = -f:|5t-5t|, (1) 

^ t=l 

where n is the number of forecasts made, St is the actual spot rate and St is 
the forecasted spot rate at time t . Meese and Rogoff (1983) use in addition 
the “Root Mean Square Error” . It is given with 



RMSE- 






t=l 



( 2 ) 



MAE as well as RMSE grow monotonically with growing forecast error. 

In addition to these concepts other performance measures are imaginable. 
Especially in forecasting exchange rates it can be important, if the future 
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exchange rate is underestimated or overestimated. Therefore corresponding 
measures can be usefull. The average underestimation is given by 



Maep°® = 



^pos 



with 



t=i 



Er = 



St — Sf if St — St ^ 0 

0 otherwise ’ ^ 



where is the number of positive forecasting errors. The average overes- 
timation, i.e. the average negative forecasting error, is given by 



MAE"®8 = —T with = 

„neg ^ 



St -St 

0 



if St — St < 0 
otherwise ’ ^ ^ 



where is the number of negative forecasting errors. Furthermore, the 
sample standard deviation 






(5) 



can be computed. Models with smaller standard deviation are preferred. 
Minimum and maximum differences between actual and forecasted spot rate 
might be used as further criteria. These measures result according to 

MIN-mm|5i-5t|, (6) 



MAX = max|5t-5i|. (7) 

To measure the fit between actual exchange rates and forecasted exchange 
rates the coefficient of determination can be calculated. It is given with 

2 

with s = -J2St,s = -j2Sf ( 8 ) 

The larger the coefficient of determination, the better the actual exchange 
rates are approximated by the forecasted values. 

With the following three criteria the number of “forecasts on target” are to be 
measured. The relative number of exact hits can be determined. Rounding 
to two decimals seems advisable. This measure is calculated as follows: 

1 

TRGER= -^Tt 

^ t=i 

1 if int [100 {St + 0.005)] = int [lOO (5t + 0.005)] 

0 otherwise 




BEST 



E (S. - s) (S. - S) 
t{s,-syt{s,-'sy 



( 9 ) 
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where int marks the integer function. Moreover the relative number of hits 
within a certain intervall can be measured. In this paper the interval! is de- 
termined in such a way that there is a maximum deviation of half a “Pfennig” 
and one “Pfennig” respectively between actual and forecasted exchange rate. 
The corresponding criteria follows 



TRl = itT, 



n 



t=i 



with Tt = 



1 if \St - 5t| < 0.005 
0 otherwise 



( 10 ) 



1 " 

TR2 = 



with Tt = 



1 if \St - 5t| < 0.01 
0 otherwise 



( 11 ) 



where TRl gives the relative number of hits given a maximum deviation of 
half a “Pfennig” and TR2 the relative number of hits given a maximum de- 
viation of one “Pfennig” . The length of the symmetric intervall determining 
hits is therefore one “Pfennig” for TRl and two “Pfennig” for TR2. 



4 Comparison of the Forecasting 
Performance 

The criteria introduced in the paragraph above can now be computed for the 
models analysed in this study. A detailed analysis of the models^ rankings 
referring to the criteria under consideration leads to the profiles shown in 
figures 1 and 2. For the reason of clarity traditional and time series models 
are represented separately, though the ranks refer to the total number of 
models. 




Figure 1: Profiles of the traditional forecasting models 
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When comparing the two figures it becomes obvious that the time series 
models generally have higher rankings than the traditional models. Espe- 
cially the VAR(4) model and the error correction model show fairly good 
results over almost all criteria. Only TR2 is an exception, as the error 
correction model has the lowest rank measured by this criterion. When 
considering the traditional models separately, it is striking that the random 
walk has the highest ranking measured by seven of the criteria. Only when 
referring to MAE^^®, STA and BEST the Hooper-Morton modell and when 
referring to TR2 FrenkePs model has the highest rank within the group of 
traditional models. The overall poor results of models based on the forward 
rate become apparent when referring to all criteria considered. There is no 
evidence of a any superior forecasting performance of these models as cited 
sometimes in other studies. 




Figure 2: Profiles of the time series based forecasting models 



In addition to the analysis of the rankings a more thorough look should be 
taken at the absolute values of the different models corresponding to the 
different criteria. For this reason each criterion is standardized to the [0; 1] 
intervall, where 0 is worst and 1 is best.^ Subsequently a visualisation by star 
plots is provided in figure 3. A relatively big star stands for a relatively good 
forecasting performance of the model under consideration over all criteria. 
The models based on the forward rate excepted, the differences between the 
models are only slight. The error correction model is marked with the largest 
star, though it is worst when measured by TR2. The over all criteria equally 
largest star is the one of the VAR(4) model. 



1 As with the criteria MAE, RMSE, MAEP^^ MAE''^^^ STA, MIN and MAX a smaller 
value is preferred, the values transformed to [0; 1] were subtracted from 1. 
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Figure 3: Star plots of the forecasting models 

A classification of the forecasting models based on the eleven criteria leads 
to the dendrogram of figure 4. As the use of the complete linkage method re- 
quires a distance matrix, the city block metric was applied. To get to an effi- 
cient weighting of the distances, blocks with MAE/RMSE, MAE^^^/MAE^®^, 
STA, MIN/MAX, BEST and TRGER/TR1/TR2 were built. The weights 
were set such that the distances within as well as between the six blocks 
contribute evenly to the aggregated distance index. 



Chicago 
Keynes 
AR(1) 
VAR(l) 
random-walk 
VAR(4) 
Hooper-Morton 
ECM 
VMA(l) 
Frenkel 
forward rate 






Figure 4: Complete linkage classification of the forecasting models 



The similarities of the different forecasting methods concerning forecasting 
performance manifest themselves very well in the dendrogram at hand. The 
most similar ones are the Chicago model and the Keynesian model. The 
models based on the forward rate differ significantly from the other models 
in terms of forecasting performance and build a class of their own. These 
results confirm our previous findings. 
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In order to allow a detailed analysis of the similarities between the models 
concerning their forecasting performance, a factor analysis was carried out. 
The result of a twodimensional representation is shown in figure 5. Almost 
87 percent of the original information is preserved. In the diagram the po- 
sition of the forecasting models can be interpreted with regard to the axis. 
It becomes obvious, for example, that Frenkehs model shows relatively high 
values for TR2, STA, RMSE, MAE, MIN, MAX and MAE^^" and 

relatively low values for BEST, TRGER and TRl and therefore a poor over- 
all forecasting performance. 



1 random-walk 

2 Chicago 

3 Keynes 

4 Hooper-Morton 

5 Frenkel 

6 forward rate 

7 AR(1) 

8 VMA(l) 

9 VAK(l) 

10 VAR(4) 

11 ECM 
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first factor (66*2 %)'. 



Figure 5: Twodimensional representation of the forecasting models 



5 Conclusions 

Due to the scope of this paper we were repeatedly forced to operate within 
certain limits. First of all, we investigated only a limited number of ap- 
proaches to exchange rate forecasting. We did not deal with chart technique, 
portfolio approaches, Markov-switching and exchange rate substitution. The 
time series models were limited to linear processes. We did not analyse 
ARCH models. Furthermore only one exchange rate - the DM/US-Dollar 
exchange rate - was investigated. There is a fair chance that analysing dif- 
ferent exchange rates would have led us to different conclusions. A further 
limitation is given by the time frame dealt within this paper. It spans a pe- 
riod of roughly two years. The choice was determined by data availability. 
Nevertheless, we have to emphasize at this point that a comparison based 
on 98 forecasts cannot only be determined by chance. 

Summarized it has to be admitted that we can only slightly outperform the 
random walk. It does not seem to make too much sense to forecast exchange 
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rates using structural models and models using the forward rate. Only the 
multivariate time series models perform better than the random walk. Thus 
it can not be unconditionally taken for granted anymore, that exchange rates 
follow a random walk. 
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Abstract: The unprecedented potential of intelligent software agents for reduc- 
ing work and information overload offers significantly decreased transaction costs 
in internet-based business throughout the whole value chain, in particular on the 
part of the end customers. Changed cost structures will transform intermediation 
and enable new market forms. The current state of agent technology and its ap- 
plication in the design of retail markets is explored and categorized. The ability of 
agents to negotiate on behalf of their owners is shown to be of crucial importance. 
The economic theory of transaction costs is applied to question wideheld believes 
about the future development and organization of electronic markets in the light 
of the new technology. 



1 Introduction 

The internet provides an exponentially growing infrastructure for business. 
By the end of 1996, 80% of America’s Fortune 500 firms had a web site 
(Economist (1997), p. 4), compared to only 34% a year earlier, and pri- 
vate usage shows similar growth patterns. By all evidence, the internet is 
becoming the next mass medium, with huge potential not only for business- 
to-business transactions, but also for retail business. The most appealing 
feature of the new medium is that it enables new modes of exchange, which 
allow for significantly reduced costs for business transactions. The adoption 
of internet-enabled technologies is changing existing markets and creating 
new, electronic ones. The change is driven by the economic impact of these 
technologies, but the economic incentives shape the technological develop- 
ment as well. To understand these interplaying factors is crucial to anticipate 
coming developments and to identify business opportunities. 

A new software technology of intelligent software agents is just emerging. 
Intelligent software agents are designed to relieve their users from computer- 
related tasks in much the same way as a human assistant would do. First 
experiences with prototypes are encouraging, and they are already quoted to 
“be the most important computing paradigm in the next ten years” (Gilbert 
(1997)). Properly designed software agents would leverage the communica- 
tive and information processing capabilities of its owner to an unprecedented, 
qualitatively new level. Applied to electronic marketplaces, software agents 
could be the breakthrough towards the economic ideal of perfect markets. 
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The following analysis classifies the current prototypic technology related 
to electronic markets and considers its potential impact on the structure 
of electronic commerce from an economic point of view, with polypolistic 
retail markets in mind. Section 2 proposes a simple but useful descriptive 
model of commerce to review the current practice in electronic retail busi- 
ness. Section 3 shortly reviews the notion of intelligent software agents. The 
previously developed model of commerce is applied to classify representative 
software agents according to the benefits they bring to their user. Section 4 
concludes with considerations on the potential impact on the future organi- 
zation of electronic retail markets. Arguments based on the economic theory 
of transaction costs show that wideheld believes on the likely development 
of electronic retail markets might soon become obsolete. 



2 Modelling Commerce and the Current 
Practice of Internet Retail Business 

To gain a better understanding of electronic commerce it is useful to consider 
its various aspects within the common framework of a descriptive model. 
Any business transaction can be dissected in successive phases, like evalua- 
tion, contracting and posttrading. Each phase involves certain tasks to be 
performed. These tasks will usually be different for buyers and sellers, but 
most of them will require some degree of mutual interaction. These interac- 
tions are crucial, and may involve flows of information, goods or payments. 
The phases, tasks and interactions to be included in such a model are clearly 
goal-dependent. This paper focuses on software agents to support transac- 
tions between individual buyers and sellers. Such a partial analysis will not 
cover the possibility of coalition forming between market participants, al- 
though this posssibility is likely to become reality as soon as the technology 
underlying electronic markets reaches the level of sophistication necessary to 
support coalitions efficiently — an issue that will certainly challenge business 
strategy and legislation alike in the near future. 

To describe the currently available software agent prototypes, it is sufficient 
to distinguish just six different phases for any successful business transaction. 
Figure 1 depicts the resulting model, which is suitable to describe non-agent 
based transactions in both internet and traditional retailing as well. 

The model is derived from Nissen’s (1996) integrated commerce model. Its 
subsequent phases represent an approximation and simplification of complex 
behavior. The phases may overlap and the migration from one to another can 
be iterative. Future technological improvements are likely to call for more 
detailed phase schemes for proper classification, e.g. the searches for the 
right product and the right supplier are usually to some degree dependent, 
but current technology offers very limited support to solve the two joint 
problems. In any case, the basic structure of the model will remain. 

In all phases, information is essential. The costs associated with overcoming 
informational problems take up a major part of the overall costs of products 
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Figure 1 : Descriptive model of commerce 




and services incurred by the end customer. Proper IT support does not 
only reduce these costs significantly. It is also changing the cost structure, 
and thereby the structure and the evolution of internet business, which will 
differ more and more from its traditional counterpart, as will become more 
apparent in Section 4. Intelligent software agents are likely to be a key 
technology promoting this change. 

Although currently widely used technologies like web sites, browsers, elec- 
tronic catalogs, email, search engines and the like (Beam and Segev (1996)) 
enabled many successful migrations of businesses to the internet (Morgan 
Stanley (1997)), they are essentially restricted to speed up traditional modes 
of interaction. Actually, current electronic markets (Mertens and Schumann 
(1996)) based on these technologies closely resemble their traditional coun- 
terparts in many respects. Suppliers are represented by web sites which 
may include electronic catalogs, customers by web browsers or more ad- 
vanced search tools, and the net connects both sites of specific hard- and 
software much like roads connect shops and homes. Customers may use 
electronic catalogs to identify their needs and to find the right product. To 
find the right supplier, they still need to do comparison shopping. Usually, 
they will search various web sites of potential suppliers, where they need to 
collect all relevant information like prices, delivery times, warranties and so 
on. That means, customers and merchants are still left with much of the 
burdens involved in doing their business. Intelligent software agents might 
relieve their users from these tasks. The resulting leverage in their informa- 
tion processing capabilities is likely to have a profound impact on business 
practice and the actual operation of markets. 
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3 Intelligent Software Agents in Electronic 
Markets 

Intelligent software agents are generally considered to be software that assists 
people and acts on their behalf (Gilbert (1997)), intended to serve a user 
much like a (human) personal assistant who is collaborating with the user 
in the same work environment (Maes (1994)). 

Unfortunately, no generally accepted definition of intelligent software agents 
has been developed so far that could be used to divide the world into agents 
and non-agents. Franklin and Graesser (1996) provide a coincise review of 
proposed definitions from various scientific disciplines in an attempt to dis- 
tinguish them from ordinary programs, and argue convincingly that fuzzy 
categories are most appropriate to characterize agents. Additionally, a host 
of often high-dimensional classifications for software agents have been de- 
veloped for different purposes (e.g. Brenner et al. (1997), Maes (1994)). 
This unsettled state calls for a pragmatic approach. Software agents for ap- 
plication in electronic markets are defined as software, capable to support 
their users with computer oriented tasks, and intelligent in the sense that 
they are able to act to some degree autonomously and goal-oriented, but are 
no more than advanced tools in information processing. It is assumed that 
the user is primarily concerned with the tasks agents support and not with 
the technological questions involved. The following classification of agents is 
therefore based on the commerce model of section 2, but the type and source 
of their “intelligence” , the number of agents involved and the environment in 
which they are situated are also of practical interest. Most important is the 
paradigm shift in the way complex problems are solved: If one compares an 
ordinary program with a mechanical tool like a hammer, one might compare 
a software agent with a craftsman, who uses ordinary software tools to solve 
the various tasks involved in the user-specified problem. The user “just” 
has to communicate the problem to his agent, and will be relieved from the 
sometimes quite demanding direct interaction with the necessary applica- 
tions, as illustrated in figure 2. The overall picture could be considerably 
more complex: There might be a hierarchy of agents, working with a host 
of specialized applications. The following examples of actually implemented 
software agents will make this approach more clear. 

Agents like PersonaLogic (http://www.personalogic.com) and FireFly (http: 
//www. firefly.com) help a consumer to locate an appropriate product. Per- 
sonaLogic characterizes products in a high-dimensional feature space. The 
user has to specify his needs, and the system supports him to filter out the 
products which do not meet the given constraints. This service is offered 
by merchants to their customers to enable them to locate the most suit- 
able product from the merchant’s catalog. FireFly takes a quite different 
approach. It recommends products via a “word of mouth” recommendation 
mechanism, which utilizes the preferences the user reveals during his shop- 
ping. His preferences are compared with those of others, and the system 
recommends to him products which have been rated favorably by users with 
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similar preferences. In doing so, FireFly does not only help to find the right 
product, but also supports the user in identifying a need in the first place. 
The system is currently used to recommend products such as music or books. 
Agents like BargainFinder (http://bf.cstar.ac.com/bf) and Jango (http:// 
www.jango.com) primarily help to find the right vendor. BargainFinder 
does so by comparing the prices from a predefined set of vendors. To get the 
information, it places requests via http in the same manner as the user could 
do by himself with a web browser. Interestingly, a lot of vendors blocked 
price requests from BargainFinder, since they do not want to compete on 
price alone. Indeed, simple price comparisons might very well be unsatisfac- 
tory for the user who might be interested in the value added services these 
merchants offer at their web sites. Jango works similar, but more advanced. 
It bypasses the blocking problem by issuing the request from the user’s web 
browser, in a way that the vendor is unable to recognize that actually Jango 
is at work. Additionally, Jango supports to find the right product, as it 
allows to search for a product category, e.g. “modem” , resulting in a list of 
products with more detailed information to choose from. 

The following agent based systems go much further in automatizing the busi- 
ness process. All of them are capable of automatizing the negotiation phase. 
Kasbah (http://kasbah.media.mit.edu) is a multiagent system that imple- 
ments a kind of closed “intelligent classified ad” market (Chavez and Maes 
(1996)). Users who wish either to buy or sell create a buying or selling agent 
for the specified product, and equip the agent with relevant information, no- 
tably a negotiation strategy. The allowable strategies are very simple. Each 
user has to set his preferred price, his reservation price and a time decay 
function that specifies how the price at which the agent would be willing to 
deal changes with time from the preferred price at the beginning to the reser- 
vation price at the end of the period the user is willing to wait until the deal 
is either concluded at a price no worse than his reservation price or canceled 
otherwise. Despite its simplicity, the strategy has shown promising user sat- 
isfaction in field experiments. Interestingly, more sophisticated strategies 
would not be trusted by many users if they felt a loose of control. While 
Kasbah is still restricted to price negotiations, the very recent Tete-a-Tete 
system (http://ecommerce.media.mit.edu/Tete-a-Tete) allows for multiat- 
tribute negotiations, that is, value added services like warranties or return 
policies can be negotiated and included in comparing offers. 

Auctions are another institutional setting for price determination in closed 
markets. It has proven effective and easy to implement particularly for 
highly standardized or rare products. Stock exchanges or art auctions are 
prominent traditional examples. Electronic auctions are also well estab- 
lished and quite successful - there are more than 95 active internet auctions 
(http:// www.yahoo.com/ Business_and_Economy /Companies/ Auctions) -, 
but participation is still time consuming and therefore costly. AuctionBot 
(http://auction.eecs.umich.edu) is a newly developed auction system which 
allows buyers to create an agent to bid on their behalf. The user is able to 
equip his agent with an appropriate bidding strategy. Furthermore, sellers 
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can create their own auctions to sell their goods, which offers great flexibility 
and control. 

Figure 3 summarizes the properties of the up to now discussed agents within 
the context of the commerce model, based on unpublished work of Guttman 
and Maes. It is apparent that current agents primarily aim to support 
the customer in the search and negotiation phases of the business process. 
They offer hardly any support to identify needs, to buy and pay, or in the 
posttrading phase. It is also important to note that commercial vendors have 
found little agent support at all. They may employ current agent technology 
to provide a more convenient shopping experience to their customers, but it 
seems that they do not And much support to actively search for customers 
or to anticipate their needs. 
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Figure 3: Buyer support provided by selected intelligent software agents 



The task-oriented classification of existing software agents demonstrates the 
potential of this emerging technology and identifies areas for promising fu- 
ture developments and applications. The next section considers the poten- 
tial impact of agent technology on the organization and design of electronic 
markets and intermediaries therein from an economic point of view. 



4 Conclusion: Potential Impact on Electronic 
Markets and Intermediaries 

The infrastructure and technologies available for market transactions deter- 
mine the appropriate organizational form of the market, which is consid- 
ered as a general means of exchange. Transaction costs analysis has pro- 
vided valuable insights into the structure and evolution of traditional retail 
markets, and is applicable to electronic markets alike. Intermediaries like 
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merchants will emerge and survive in real-world imperfect markets if they 
succeed to reduce the overall transaction costs to a level below the sum of 
the transaction costs incurred by both market partners (like producer and 
final customer) in unmediated interaction (Picot (1986)) or to generate new 
business. Software agents can be powerful mediators. Their application will 
change the cost structure significantly and in consequence business practice 
itself, whereas technologies with similar impact are not known for tradi- 
tional markets. To understand the migration from traditional to electronic 
business, it is particularly important to consider the transaction costs of the 
retail customer more carefully. Although many published numbers report 
on reduced transaction costs on part of the vendor, the potential savings on 
the part of the retail customer are usually considered to be no more than a 
matter of “convenience.” Such a single-sided approach may result in severe 
misjudgements. 

The costs incurred in the search phases of the business process by the cus- 
tomer are usually significant. This is true for the direct costs involved as 
well as for the indirect opportunity costs that need to be taken into account. 
If the customer is unable to locate the best product or vendor and settles 
for a less than optimal solution, the resulting loss in utility corresponds to a 
cost. Negotiations are also not common in most retail markets, presumably 
because they are time consuming and inconvenient. Automatizing the ne- 
gotiation process would avoid the prohibitive costs of negotiating in person, 
and allow for more flexible dealing compared to the traditional take-it-or- 
leave-it quotation of terms to the benefit of customers and vendors alike. 
These benefits are unattainable using traditional practice, which again cor- 
responds to transaction costs. 

Successful designs in traditional retailing like shopping malls or speciality 
retailers made shopping easier for customers, given traditional modes of ex- 
change. These well known designs have been adopted to a large extent and 
still dominate the evolution of electronic commerce, as noted in section 2. 
The wideheld belief that electronic commerce will continue along the tradi- 
tional lines, offering low entry costs and new mail order opportunities seems 
questionable if the qualitative changes in the transaction cost structures are 
properly taken into account. The enlarged information processing capabil- 
ities of customers due to the advent of agent technology is likely to make 
business designs of the kind obsolete which have been developed for tradi- 
tional marketplaces in which distance and therefore the distribution of mar- 
ket participants in space and the physical infrastructure connecting them is 
of prime importance. Instead, agent based markets may come closer to the 
economic ideal of perfect point markets (Reus (1998)), where automated 
negotiations will allow for flexible price determination for many products 
(Beam and Segev (1997)). Many functions performed by vendors will be 
performed using agent technology, calling for new business strategies. 

On the other hand, customers first need to get familiar with the new technol- 
ogy, that is, they have to incur significant costs to learn how to use and how 
to integrate the software enabling the electronic markets into the existing 
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environment. This software will be the market from the users point of view, 
and all evidence from the software markets prove that people are usually 
very reluctant to accept the sunk costs involved in major system changes. 
This inertia gives rise to increasing entry barriers for market making software 
and fierce future competition to set standards. 

Proper coordination in electronic markets will not only challenge technology, 
but legislation and business strategy alike. This is the traditional domain 
of economic sciences, whose principles apply even better to the electronic 
market systems than to their tradtitional counterparts. Future design of 
such artificial “economies” and software agents acting in them need to be 
guided by economic priciples to ensure satisfactory operation and trust into 
the system, as first fruitful applications demonstrate. 
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Abstract: This paper describes selecting of risk measures from statistical and 
fundamentals risk measures set. Two concepts of risk are analyzed: volatility 
of returns and sensitivity of returns. To group stocks is used k-centroids method 
(isodata). Statistical verification of this method is presented. The paper illustrates 
the uses of chosen risk measures in analyzing stocks listed on the Warsaw Stock 
Exchange. 



1 Introduction 

Using the effects criteria risk may be understood in following ways: 

1. The dictionary defines risk as exposure to the chance of injury, damage 
or loss. Intuitively risk is the chance of something bad happening. 
Better effect than expected is not a risk. 

2. On the other side risk has also some association with the possible 
occurrence of good outcomes. Essentially risk is something uncertain, 
it could be a chance or a threat. 

These two basic meanings of risk are reflected in different statistical measures 
of risk. We take into account quantitative approach of risk. In this approach 
we mention two concepts of risk: 

• volatility of returns, 

• sensitivity of returns. 

Most investment textbooks define investments risk to be the volatility of 
returns, measured by the standard deviation (or, equivalently, the variance) 
of the return distribution. The biggest problem with the standard deviation 
is that it discriminates against investments with volatility on the upside. 
We assume risk averse investors. Therefore, if we define risk to incorporate 
both good and bad results, then our risk-reward evaluation of potential in- 
vestments penalizes investments that might give us positive surprises as we 
just penalize those investments for the possibility of giving us negative sur- 
prises. All of this concern is moot if investments returns are symmetrically 
distributed, particularly in the form of a normal distribution. Applied to 
young capital markets, where securities are characterized by high volatility 
of returns as risk measure is often used quartile deviation (based on first 
and third quartile). Based on volatility’s returns concept, finally from these 
wide statistical set of risk measure have chosen: 
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1. Standard deviation because it is a classical risk measure. 

2. Quartile deviation because the polish stock market is growing and 
returns have high volatility. 

No one risk measure can work well in all circumstances. Therefore alterna- 
tives must be considered not only in the light of how well they describe the 
return distribution, but also in terms of the complexity that they add to the 
analysis. According to the second concept (sensitivity of returns) we used 
the beta coefficient (Alexander et al. (1995), Haugen (1993)). Beta mea- 
sures the so called market risk (systematic risk, nondiversifble risk). Beta is 
a slope term in a security’s market model (simple regression model): 



n = ai-\- PiVM + 



( 1 ) 



where: 

Vi - denotes return on security i for given period; 
ai - intercept term; 

Pi - slope term; 

tm - return on market index for the same period; 

- random error term. 



Why is it a sensitivity measure? Because the slope in market model measures 
the sensitivity of the security’s returns to the market index’s returns. As 
market index we have chosen the WIG (Warsaw Stock Exchange Index). 
Stocks with beta larger than one are more volatile then the market index 
and are known as aggressive stocks. In contrast, stocks with betas less than 
one are less volatile then the market index and are known as defensive stocks. 



As fundamental risk measure we used the Z score from Edward Altman’s dis- 
crimination function (Altman (1968, 1983)). E. Altman has used multiple- 
discriminant analysis to predict bad business risks. Altman’s object was to 
see how well financial ratios could be used to distinguish which firm would go 
to bankrupt during the period 1946-1965. The index Z of creditworthiness 
is: 



Z = 3.3 



EBIT 

A 




MV , RE ^ WC 



( 2 ) 



where EBIT is earnings before interest and taxes, A denotes total assets, 
S total sales, MV market value, BVD book value of debt, RE retained 
earnings, WC working capital. 

This equation was very good to distinguish the bancrupt and nonbancrupt 
firms. On the former 94% had Z score of less than 2.7 the year before they 
went bancrupt. In contrast 97% of the nonbancrupt firms had Z scores above 
this level. 
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In our analysis we use Zscore as an aggregate of financial ratios. Because 
results of Z are only appropriate for US firms and we do not want to predict 
bancruptcy, we do not need the exact range of Z’s values. Simply a low 
value of Z denotes high risk. It means that our random variable Z is a 
destimulus. This describes a variable for which is bigger the lower the risk 
(this means a better firm). 

The problem considered in this paper is a test of risk measurement not only 
from a statistical point of view. To measure investment risk we take into 
account fundamental variables, too. 



2 Clustering 

Before clustering, the data have to be standarized. 



for i = 1, . . . , n and j = 1, . . . , m 
where: 

Zij - denotes standardized value i-th observation for j-th variable; 

- value i-th observation for j-th variable; 

Xj - average value for j-th variable; 

Sj - standard deviation for j-th variable. 

To discriminate stocks we have used the k-centroids method (Jajuga (1997)). 
This method based on minimization of the following function: 

E E (4) 

/c=l i&Ck 

where: 

K - denotes the number of class; 

i £ Ck - means, that i-th observation belongs to k-th class 
(k = l,...,K)- 

dik - the distance between i-th observation and vector of means for 
k-th class. 



The solution of minimization problem may be obtained through an iterative 
algorithm. We assume the number of the class is determined. 

This method is well known, but we want to verify that the resuts of our 
clustering are statistically significant. 
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3 Verification of the k-centroids Method 

The sum of squared differences can be written as follows: 

= {Zil-Vkl)^ + . . . + {Zim-Vkmf = ^ Zi (5) 

By assumption zi has normal distribution with following parameters: 

Zi ~ A^(0, Si) 

where denotes the covariance matrix. 

The test with empirical statistic for m degrees of freedom is formed as: 

'-T -l~ 2 

^ X 

where zj zi ~ {gi • Xm) has a noncentral x^ distribution. 



4 Empirical Results 

Monthly stock returns from the stocks listed in Warsaw Stock Exchange are 
used in this paper. The returns are collected from the period of January 1994 
to January 1998. To calculate the beta coefficient (apart from quality) the 
stock exchange index WIG is chosen (Warsaw Stock Exchange Index). We 
have divided securities into three groups characterized high, medium and low 
risk. We made two classifications: one with only quantitative risk measures 
and the second one with quantitative and fundamental risk measures. 

FIRST CLASSIFICATION - 43 Stocks 

In this classification are used only quantitative risk measures such as: stan- 
dard deviation, quartile deviation, beta coefficient. 

The high risk class contains 17 securities: 

BIG (Ba), BSK (Ba), Domplast (C), Exbud (Bu), Irena (C), Kable (E), 
Kable BEK (E), Krakchemia (S), Krosno (C), Polifarb (C), Rafako (E), 
Rolimpex (H), Sokolow (F), Swarzedz (Fu), Tonsil (E), Universal (H), WBK 
(Ba); 

with means: standard deviation s = 0.230, 

quartile deviation q = 0.084, 
beta coefficient (3 = 1.531. 

The medium risk class contains 11 securities: 

Agros (H), Elektrim (H), EsPeBePe (Bu), Mostostal Export (Bu), Mostostal 
Warszawa (Bu), Mostostal Zabrze (Bu), Okocim (F), Prochem (Bu), Prochnik 
(T), Remak (E), Stalexport (H); 
with means: standard deviation s = 0.177, 

quartile deviation q = 0.128, 
beta coefficient (3 = 1.038. 
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The low risk class contains 15 securities: 

Amerbank (Ba), BRE (Ba), Debica (C), Drosed (F), Efekt (S), Indykpol 
(F), Jelfa (C), Kredyt Bank (Ba), Novita (T), Optimus (E), PP-AB (Ba), 
Vistula (T), Wedel (F), Wolczanka (T), Zywiec (F); 
with means: standard deviation s = 0.141, 

quartile deviation q = 0.053, 

beta coefficient ji = 0.825. 



SECOND CLASSIFICATION - 36 Stocks (Without Banks) 

In this classification are used quantitative risk measures: standard deviation, 
quartile deviation, beta coefficient and as fundamental risk measure Z-score 
from Altman’s discrimination function. 

The high risk class contains 16 securities: 

Domplast (C), Elektrim (H), EsPeBePe (Bu), Exbud (Bu), Kable (E), Krak- 
chemia (S), Krosno (C), Mostostal Warszawa (Bu), Prochnik (T), Rafako 
(E), Remak (E), Rolimpex (H), Sokolow (F), Swarzedz (Fu), Tonsil (E), 
Universal (H); 

with means: standard deviation s = 0.224, 

quartile deviation q = 0.109, 

beta coefficient ^ — 1.414. 

The medium risk class contains 13 securities: 

Agros (H), Drosed (F), Efekt (S), Indykpol (F), Jelfa (C), Mostostal Export 
(Bu), Mostostal Zabrze (Bu), Optimus (E), Prochem (Bu), Stalexport (H), 
Vistula (T), Wolczanka (T), Zywiec (F); 

with means: standard deviation s = 0.149, 

quartile deviation q = 0.078, 

beta coefficient f3 = 1.867, 

Z-score Z = 4.128. 

The low risk class contains 7 securities: 

Debica (C), Irena (C), Kable BFK (E), Okocim (F), Novita (T), Polifarb 
(C), Wedel (F); 

with means: standard deviation s = 0.183, 

quartile deviation q = 0.072, 
beta coefficient (3 = 1.22, 

Z-score Z = 9.241. 

Industry: Bu - Building, C - Chemicals, E - Electric, F - Food, Fu - Furniture, 
S - Trading and Servicing, T - Textile and H - Holding. 

Similar problem of clustering of stocks listed in Warsaw Stock Exchange 
is has been examined in the paper (Kowalewski and Kuziak (1997)) where 
classification didn’t include fundamental values. As has been shown above, 
fundamental values considered as the Z-score can have a significant effect on 
the classification results. Here EsPeBePe is an interesting example. Many 
investors expected bancruptcy of EsPeBePe. Including the Z-score causeda 
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movement of EsPeBePe from the medium risk class to the high risk class. 
For presented classifications empirical statistics were calculated. Critical 
values for statistics at confidence level 0.05 are: 7.815 for 3 degrees of 
freedom (first classification) and 9.488 for 4 degrees of freedom (second clas- 
sification). 

Comparison of empirical statistics and critical values gave following results: 

FIRST CLASSIFICATION 

Stocks with higher value of the empirical statistics than 7.815: 

High risk class: Domplast (8.32) , Sokolow (8.37) 

Medium risk class: EsPeBePe (8.24), Remak (9.49) 

Low risk class: PP-AB (11.26) 

SECOND CLASSIFICATION 

High risk class: only Remak (10.74) and Sokolow (10.98) have 

higher value of empirical statistics than 9.488 
Medium risk class: all stocks are below 9.488 

Low risk class: all stocks are below 9.488 

In the first classification 88.4% examined stocks are in right risk classes. 
Results are accordant with expectations, because these five securities are 
mostly speculative securities. In the second classification, where fundamen- 
tal variable is add results are better - 94.4% examined stocks belong to right 
risk classes. This paper examined statistical significance the classification of 
stocks listed in Warsaw Stock Exchange. To classify stocks in risk contex 
was used k-centroids method. Including fundamental variable to measure 
investment risk gave very reasonable results. The results of this verification 
confirmed significance of this method. 
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Abstract: Distributional assumptions of financial return data are an important 
issue for asset-pricing and portfolio management as well as risk controlling. In 
order to capture the departure of empirical observations of financial return data 
from normality the Student’s ^-distribution has been proposed as an alternative 
fat-tailed distribution in the literature. In this paper we (i) briefly summarize 
the Student’s ^-distribution; (ii) compare the tail behavior of the Student’s t- 
distribution with empirical data; and (iii) discuss some implications of the empir- 
ical results on the risk management based on Value-at-Risk. We also suggest a 
simple statistic as a measure of tail-thickness based on the sample quantile and 
the first absolute moment. 



1 Introduction 

The underlying distributional assumption for financial return data is impor- 
tant for portfolio management and risk controlling. Since the pioneering 
work of Bachelier (1900) the Gaussian distribution has been an usual as- 
sumption. It has been, however, observed that financial return data are 
typically fat-tailed and excessively peaked around zero. Since the distribu- 
tions of many observed financial return data have tails that are fatter than 
those implied by conditional normal distribution, the Gaussian assumption 
has severe consequences on its empirical application — especially risk con- 
trolling. The Black-Scholes option pricing formula, for example, used by 
Merton (1973) and others to evaluate deposit insurance arbitrarily assumes 
a normal distribution and, therefore, greatly understates the value of deposit 
insurance if in fact the distribution is fat-tailed. Furthermore, risk managers 
may underestimate the risk of their positions if they assume normality for 
financial return data. 

To capture the departure of financial return data from Gaussianity there 
have been developed various alternative probability laws.^ Blattberg and 
Gonedes (1974) compare the t-distribution with the stable non-Gaussian 
distribution and conclude that the ^-distribution describes empirical data 

^For example, Mandelbrot (1963) put forward to the stable Paretian distributions; 
DuMouchel (1973) studies contamination of a normal distribution by a symmetric stable 
non-Gaussian distribution; and Hamilton and Susmel (1994) examine regime-switching 
models in which the latent innovations come from normal and Student’s ^-distributions. 
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better than the stable non-Gaussian distribution. They point out the non- 
stability property of empirical data with respect to time aggregation, i.e., 
return data defined over a long period may be described by the normal 
distribution. Bollerslev (1987) also considers the ^-distribution as underlying 
distribution for financial return data. 

In this paper we summarize the ^-distribution as an alternative fat-tailed 
distribution and compare the tail behavior of the ^-distribution with some 
empirical data. For the comparison, we propose a statistic which is a ratio of 
a sample quantile of interest and the first absolute moment (FAM). Finally, 
we discuss some implications of the empirical results on the risk management 
in Value-at-Risk (VaR) under the assumption of the ^-distribution. 

The rest of the paper is organized as follows. Section 2 gives a brief overview 
on the distributional properties of the ^-distribution and proposes a new 
statistic. Applying the new statistic. Section 3 compares the tail-behavior 
of the ^-distribution with some empirical data, and discusses some empirical 
implications of the assumption of the ^-distribution on risk measuring based 
on VaR. Section 4 contains concluding remarks. 



2 Unconditional Distribution of Return Data 
and Student’s ^-Distribution 

The increased importance played by risk in modern financial theory has ne- 
cessitated the development of the autoregressive conditional heteroskedastic 
(ARCH) model introduced by Engle (1982), which distinguishes conditional 
from unconditional second order moments. While the unconditional second 
order moments are often assumed time invariant for the analysis of financial 
data, the conditional second order moment may be modeled as time vari- 
ant depending on the past states of the world. Understanding the shape 
of the underlying distribution of the return data of interest is crucially im- 
portant for risk measuring and its controlling, because the specification of 
the underlying distribution is a bridge to a quantification of risk in portfolio 
management. 

Letting Pt denote the price of an asset at time t and assuming no dividends 
the return over period {t - l^t] is typically modeled as Vt = log{Pt/Pt-i)- 
The return process, r^, is observable and defined to follow an ARCH model, 
where the conditional mean, Et-i[rt] = 0, and the conditional variance, = 
Et-i[r^], depends non-trivially on the past observations. In practice, return 
data are typically standardized, which have conditional mean 

zero, and a time invariant conditional variance of unity. If the conditional 
distribution for rj' has a normal distribution, the unconditional distribution 
for Tt is leptokurtic. 



( E [ a ?])2 - ( E ,_ i [ a ?])2 ^ 
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by Jensen’s inequality,^ where equality holds for a constant conditional vari- 
ance only (see Bollerslev et ah, 1994). I.e., if financial return data follow an 
ARCH model, the empirical density of the data is leptokurtic and fat-t ailed. 
Therefore, in order to analyze such data one may need a distributional as- 
sumption which can describe the leptokurtic and fat-t ailed density of r^. 
The family of mixed normal distributions is an interesting candidate for 
unconditionally distributional assumption. The leptokurtic and fat-tailed 
density can be derived as a continuous mixture of normal densities. In 
this case the variance of the normal distribution is a random variable (r.v.) 
whose density acts as a weight function in mixing the conditional normals 
into the unconditional density. When the reciprocal of the variance follows 
a Gamma-distribution, then the unconditional distribution of returns is a 
t-distribution. The ^-distribution is then given by the distribution of the 
ratio of a standard normally distributed r.v. to the square root of an in- 
dependently distributed chi-square r.v. divided by its degrees of freedom 

(d.f.). 



Definition 1 (Student’s t- distribution) 

If X is a r.v. having density 

^ r|(t + 1)/2] 1 

r(A:/2) \/fc^(l -h ’ 

then X is defined to have t- distribution and its density is called a t-distribu- 
tion with k d.f. 



Note that the mean and the variance of the ^-distribution are 0 for A; > 1 
and k/{k — 2) for A; > 2, respectively. For A: < 2, the t-distribution has 
infinite variance, and for A: = 1 it reduces to a Cauchy distribution with 
no finite mean. If the number of d.f. approaches infinity, the ^-distribution 
converges to the standard normal distribution. For the purpose of practical 
comparison of the ^-distribution with standardized empirical data, it is useful 
to standardize for A; > 2 the density 

^ r|(t + i)/2| 1 

yj{k — 2)7t( 1 -f ! {k — 

The density function of the resulting standardized variables has the desired 
properties, namely it shows leptokurtosis and fat tails compared with the 
standard normal distribution. 

Next, we use a sample quantile estimator and FAM to define a simple statis- 
tic which compares the underlying theoretical distribution with empirical 
data. Let Xp be the p-th population quantile, and E[|A|] be the FAM, then 
the ratio 

E[|X|] 

^E[rt] = E[E^_i[r'‘]] = E[3af] = 3EK^] > 3(E([a?])2. 
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measures tail-thickness at the p-quantile, where typically p — 0.99 for the 
measurement of risk in the context of VaR. To estimate the statistic in 
(2), one uses a sample quantile, Xp, suitably corrected for continuity, and 
E"=i \^i\ as consistent estimator for Xp and E[|X|], respectively. When 
d.f. converges to oo, i.e., for the standard normal distribution, we have then 
^ gg = 2.3263/ (2/ = 2.9156. The will become bigger as d.f. decreases. 
The advantages of the statistic in (2) are as follows: it is extremely simple 
to calculate; it provides an estimator for d.f., which we need for standardiza- 
tion of t-distributed data; it exists in all — empirically relevant — probability 
laws, i.e., t-distributions with d.f. > 1; and it is invariant to standardization 
by given symmetry about zero of the underlying distribution, which is an 
indispensible property for empirical work. A useful modification of can 
be given by 

• E[|A-|]’ 

where X~ is the r.v. with a symmetric density based only on non-positive 
returns.^ The is designed to work appropriately for asymmetric financial 
return data, if risk managers are more interested in negative shocks whose 
shape (both in the magnitude and the frequency of outliers) often differs 
from that of positive shocks. 

Table 1 gives the quantiles of the standardized t-distributions with unit vari- 
ance at 99%, Tgg, FAM and ^gg for some selected d.f. Note that if one calcu- 
lates the quantiles and FAM from a rescaled (usual) t-distribution, both of 
them will increase proportionally as d.f. decreases so that the corresponding 
value of remains unchanged. 



3 Empirical Application: 

Measuring Risk Based on Value-at-Risk 

In recent years, measuring risk based on VaR is intensively investigated in 
the field of risk controlling (see Jorion (1997) and Duffie and Pan (1997) for 
introductory literature on VaR). VaR can be defined as the worst possible 
loss over a given time interval at a given confidence level. One can formulate 
VaR in the simplest case as a product of quantile of interest of the hypoth- 
esized (standardized) normal distribution for data, z^_^, and the standard 
deviation from corresponding empirical data, a, 



3 



VaR = 



P(X- <x) = { 



P(X < x) 

2P{X < 0) ’ 

1 - P(X- < -x), 



for a; < 0, 
for X > 0. 



( 3 ) 
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Table 1 

Quantiles of the standardized ^-distributions at 99%, FAM and ^ 



k 


"^.99 


FAM 


^.99 


oo 


2.33 


0.80 


2.91 


10 


2.47 


0.80 


3.11 


9 


2.49 


0.79 


3.15 


8 


2.51 


0.78 


3.20 


7 


2.53 


0.78 


3.27 


6 


2.57 


0.76 


3.36 


5 


2.61 


0.74 


3.52 


4 


2.65 


0.70 


3.78 


3 


2.62 


0.59 


4.43 



where one uses the sample standard deviation as an estimator for a. In 
(3), depends on the distributional assumption for financial return data 
and, hence, the distributional assumption is a bridge to a quantification 
of the risk. Typically, = 2.3263 under the assumption of normality. 
However, the normality assumption is not an appropriate approximation 
for the unconditional distribution of empirical data, as we discussed in the 
previous section, if return data follow an ARCH-process. 

As an empirical application of the quantile estimation for VaR under the 
assumption of ^-distribution, we take the daily German Stock Market Index 
(DAX), the daily Standard & Poor’s 500 (S&P500), the daily exchange rates 
of Deutschmark (DM) vs. US dollar ($) and the daily exchange rates of 
Japanese Yen (JY) vs. USS.'^ Table 2 shows the corresponding quantiles 
obtained by using and under the assumption of ^-distribution. The 
corresponding quantiles, r^, result from a linear interpolation, namely = 

^min + (^p — ^p^rnin) ^ {^max ~ '^mm)/(Cp,max ~ where and Cp,max 

are values from Table 1 with f < £„ < and r . and r the 

corresponding quantiles. 

All corresponding quantiles for investigated data under the assumption of 
the ^-distributions are bigger than 2.3263 and, hence, risk managers would 
underestimate their risk, if they use the quantile from the standard normal 
distribution. The estimated quantiles based on are mostly bigger than 
those based on ^p, which means that negative shocks have more fatter tails 
than positive ones and, therefore, may give a better information for 
the risk controlling. The differences between quantiles of the standardized 
normal distribution and of the standardized t-distribution will be bigger for 
the more extreme quantiles. We obtain, for example, = 5.07 for A; = 4, 



'^We use 7,422 observations (July 02, 1962 - Dec. 30, 1991) for S&:P500, 8,923 (Jan. 04, 
1960 - Sep. 09, 1995) for DAX, 5,401 (Jan. 08, 1973 - July 28, 1994) for DM/USS, and 
5,401 (Jan. 08, 1973 - July 28, 1994) for JY/US$ exchange rates. 
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Table 2 

Estimated values of ^ gg and and corresponding r-quantiles 



Data 


^.99 


^.99 


^.99 


'^.99 


DAX 


3.39 


2.57 


3.47 


2.60 


S&P500 


3.79 


2.65 


3.58 


2.62 


DM/US$ 


3.53 


2.61 


3.60 


2.62 


JY/US$ 


3.61 


2.62 


3.80 


2.65 



whereas = 3.09. This demonstrates that one needs a more extreme 
quantile (than the usual 99%) in the context of VaR to fully capture the risk 
of extreme returns due to the fat-tails of financial data. 



4 Concluding Remarks 

The calculation of VaR according to the capital accord to incorporate market 
risks proposed by Basle Committee on Banking Supervision (1996) is based 
on the normality assumption, but it has the so-called insurance factor of 
three, namely 3 x VaR. Interestingly, the insurance factor works well so long 
as data of interest have finite variance under the assumption of ^-distribution. 
If d.f. < 2, i.e., in the case of no finite variance, one can not use the sample 
standard deviation as an estimator for a in (3). Huschens and Kim (1998) 
consider the measurement of risk based on VaR in the presence of infinite 
variance. 
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Abstract: There are various statistical techniques to estimate the market risk of 
a portfolio by identifying market risk factors and modeling the impact of factor 
changes on portfolio value. This paper shows how Value-at-Risk (VaR) estimates 
for market risk are obtained using artificial neural networks (ANN). 



1 Introduction 

The term “market risk” refers to financial losses due to movements in market 
prices of securities. Measures of market risk can be used to control and 
manage risk bearing business activities and to allocate financial resources. 
On the other hand market risk measures are needed to comply with the 
requirements of national regulatory authorities. 

The most general approach to describe potential losses from a portfolio would 
be to estimate the complete probability density function of changes in port- 
folio value over a given horizon. This function reflects knowledge about the 
market risk of the portfolio. In most cases such detailed information is not 
needed. To guarantee e.g. the stability and liquidity of a bank only losses 
beyond some critical value are important. For this reason market risk mod- 
els reduce the information content of a distribution or density function to 
a single statistical measure. In the case of VaR this measure is a quantile 
determined by a one-sided prediction interval of future portfolio losses 



Prob(L, <VaR) - p (1) 

Lt = ( 2 ) 

where Lt denotes the loss over the investment horizon At and Wt the market 
value of a portfolio at a given time, p refers to a given probability. If we 
specify an investment horizon of At = 1 day and a probability of say 95% a 
VaR measure of $ 100.000 simply says that the probability of loosing more 
than $ 100.000 over the next day is less than 5%. 

In most practical cases it is impossible to determine the distribution of 
portfolio losses straightforward since past changes in portfolio value reflect 
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changes in market prices as well as the past portfolio structure which usually 
differs from the current portfolio composition. 

It is therefore necessary to identify market risk factors R = (i?i, i ?25 • • • , Rk)' 
and convenient to model portfolio losses as a function of the relative changes 
X = (Xi, X2, . . . , XkY of those market risk factors. If we consider an invest- 
ment horizon of one day we write for equation (1) for a given time t 



Lt = W{Rt-i) - W{Rt) 



(3) 



Xt.i = 



Rt,i ~ 



1,2,. ..,A: 



(4) 



Prob {L{Xt) < VaR | r^.i, x^_i, x^_ 2 , . . .) = P (5) 

where rt_i, Xt_i, x^_ 2 , . . . denote the history of observed market risk factors 
or their returns respectively. 

Equation (5) shows the random variable Lt to be a function of a vector 
of random variables Even if the distribution of X^ is known finding an 
analytic solution for (5) may turn out to be impossible if L{Xt) is a nonlinear 
function. Thus the key to VaR estimation is to determine the distribution 
of portfolio losses given the distribution of market risk returns.^ 



2 VaR Models 

Most VaR models assume risk factor returns to be generated by a pre- 
specified distribution. In that case a major part of VaR estimation consists 
of estimating the parameters of that distribution using observed data. A 
common choice is the (multivariate) Gaussian distribution^. 

The main reason for using a Gaussian distribution is that this assumption 
leads for linear portfolios to a particularly simple solution for the distribution 
of portfolio losses^. If we assume that the conditional distribution of market 
risk returns at time t is Gaussian 

X,-N(ax„S,) (6) 

then we can write for the distribution of losses^ for a linear portfolio P 

^ VaR models differ in their approach to model the distribution of X and to “translate” 
this distribution to the distribution of portfolio losses. We will discuss some models in 
section 2. 

^See for example Jorion (1997) p. 149ff. 

linear portfolio is one, where changes in portfolio value depend linearly on changes 
in market risk returns. A typical example is a foreign exchange portfolio where exchange 
rates constitute the market risk factors. The problem gets more complicated if currency 
options are involved. 

^Note that equations (6) and (7) contain a time index t indicating that the parameters 
of the distribution of market risk returns and portfolio losses may vary over time if the 
data generating process is nonstationary. Throughout this paper we estimate VaR on the 
basis of predicted conditional distributions. 
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specified by w = {wi,W 2 , ■■■ , Wk)'^ 

Lt = w'Xt ~ N(/ip,t,CTp_() 



( 7 ) 



with 






With the normality assumption the calculation of VaR for linear portfolios 
is quite simple. It is calculated as a quantile of a Gaussian distribution as 
follows: 

VaR = /xp,t + (p) ap,t (8) 

where denotes the p-quantile derived from the standardized Gaussian 

distribution. 

If we assume market risk returns to be distributed independent and identical 
over time we can write iip^t — ~ easily expand equation (8) 

over an arbitrary investment horizon T. 



VaR = npT + ^-\p)apVf (9) 



Equations (8) and (9) are frequently called the “delta-normal” approach to 
VaR. 

Since empirical evidence does not support the assumption of independence 
and normality in market risk returns it is reasonable to consider more so- 
phisticated models. A first step is to model time varying parameters of 
the distribution of risk factor changes without giving up the normality as- 
sumption. In J.R Morgan’s RiskMetrics this is done by calculating the 
covariance matrix using exponential moving averages.^ Another approach is 
to fit ARGH/GARCH-models to the series of market returns to capture the 
variation in the conditional variance of risk factor returns.^ 

Other approaches use Historical Simulation or Monte Carlo Simulation. 
Both offer completely different approaches to estimating VaR.® In practice 
data problems or the computational burden involved prohibits the applica- 
tion of these techniques to portfolios with large numbers of risk factors. 



3 VaR-Estimation with Artificial Neural Net- 
works 

Section 2 showed that estimating the distribution of market risk returns is a 
crucial part of estimating VaR. Artificial Neural Networks (ANN) have been 
widely applied to various problems of (conditional) density estimation. Some 

^Each element Wi represents the negative sensitivity of the portfolio value to a relative 
change in risk factor i. 

®See J.P. Morgan/ Reuters (1996) p. 77ff. 

^See Hsieh (1993) and J.P. Morgan/ Reuters (1996) p. 88f. 

®See for example Butler and Schachter (1996) and Holton (1998). 
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ANN models assume some specific (parametric) form for the distribution 
under consideration. Others resemble non-parametric techniques for density 
estimation.^ One suitable approach for modeling conditional densities are 
Mixture Density Networks}^ Mixture densities define a very broad and 
flexible class of distributions that are capable of modeling completely general 
distributions. 

The main idea of a Mixture Density Network (MDN) is to superimpose sim- 
ple component densities with well known properties to generate or approx- 
imate more complex distributions. In the model considered here the out- 
puts of the neural network determine the parameters of the mixture model. 
Equations (10) and (11) define a conditional Mixture Density Network with 
Gaussian component densities^^ h with the parameters of the mixture model 
to be general (continuous) functions of the network inputs 

m 

fx\z=z (x) = (x|z) (10) 

i=l 

hi{x\z) = (27r)-'/V“'(z)exp 2a^(zV ^ ~ ') 

In order to gurantee (10) to be a probability density function the mixing 
coefficients have to satisfy the constraint: 

Y^ai = l, ai>0, i = l,2,...m 

i 

To estimate the parameters of the mixture density network we use the 
data sets Dx — (xi, X 2 , . . . , xt) with x^ = ^ and Dz = 

(zi, Z 2 , . . . , zt) with Zt — ( 2 : 1 , 2 : 2 , . . . , zi^t)' and define the error function E 
to be the negative logarithm of the likelihood function: 

^ = n /x,z(xt, Zt) = n /x|z=zt(xt)/z(zt) (12) 

t=i t=i 

^For an overview see Neuneier et al. (1994). See also Weigend and Srivastava (1995), 
Williams (1996) and Husmeier and Taylor (1997). 

^°See Bishop (1994,1995). Networks of this type are frequently called Gaussian Mixture 
Networks (GMN). The GMN mentioned in Neuneier et al. (1994) although differs slightly 
from the model presented here. 

^^Note that (11) uses a single scaling parameter a rather than a full covariance matrix. 
This does not inhibit the modeling of dependencies in Xt though. 

common approach in time series prediction is to embed a series in lag space. To 
predict /x|z=zt (x) at time t we might use zt = • • • ,2:/,t_n)'. Alternatively 

consider the special case where a single risk factor return X is assumed to follow an 
ARCH(p) process. An appropriate MDN would consist of a single linear hidden neuron 
and p input neurons. The input vector Z would contain past squared values of X. 

Equation (12) assumes the training data to be drawn independently from the mixture 
distribution. Minimizing (13) is equivalent to maximizing (12). The term /z(zt) will be 
excluded since it does not depend on the parameters of the mixture model. 
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T 

E = -\nC = Y^E^ (13) 

= -ln|^at(z()/ii(xt|zt)| (14) 

To minimize (13) we apply a standard back-propagation algorithm which 
requires to calculate the partial derivatives of the error function with re- 
spect to the parameters of the mixture model. Using the chain rule this 
gradient information can be used to adjust the network weights accordingly. 
It is known that simple gradient descent based methods are very slow for 
two layer networks. Ormoneit and Tresp (1995), Jordan and Xu (1993) 
and Husmeier and Taylor (1997) explore the application of the Expectation 
Maximization (EM) algorithm to MDN and found a signifikant speeding up 
of the learning process. To apply the EM algorithm to MDN requires at 
least some parameters of the mixture model not to depend on the network 
inputs and network weights respectively. Thus the EM scheme cannot be 
applied straightforward to the mixture model discussed in this paper. As 
an alternative fast gradient based techniques (e.g. second order methods) 
should be taken into account. 

Training a Mixture Density Network using a maximum likelihood param- 
eter estimate may easily lead to overfitting. To maximize the likelihood 
function we simply need to concentrate one Gaussian kernel density on a 
single data point and let the standard deviation of that kernel approach 
zero. This obviously leads to very poor predictions. In this paper we used 
very simple mixture models with few parameters to make the case of a triv- 
ial maximization of the likelihood less likely. First results (Locarek-Junge 
and Prinzler (1998a, 1998b)) support our approach. A more sophisticated 
approach to regularize MDNs can be found in Ormoneit and Tresp (1995). 



4 Calculating VaR 

In the following sections we assume that the unknown probability density 
function of market risk returns can be approximated by a conditional mixture 
distribution as specified in (10). The first step in estimating the VaR of a fi- 
nancial position is to estimate the probability density function of the market 
risk returns belonging to that position. In our case this is done by training 
the MDN over a given training data set T>train — (xi, Zi, X 2 , Z 2 , . . . , x^, z^). 
After the training of the network is completed we need to calculate the 
distribution of future portfolio losses /Lt(0- ^ given prediction data 

set latest = (xT+i,zr+i,XT+ 2 ,ZT+ 2 , • • • ,xr+iv,ZT+v) we calculate for every 

^^This does not mean that it should not be taken into consideration at all. The General- 
ized EM algorithm (GEM) may be a fast alternative to backpropagation and its variants. 
We are going to explore this in further research. 
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t our estimate fx\z=zt (x) of (10). Note that x is not known at the time the 
prediction is made and does not enter into the prediction step. It is used 
(x == x^) although in evaluating the prediction performance. Zt on the other 
hand has to be known and denotes a vector containing relevant information 
for predicting the distribution of X at time t. 

Since the portfolio loss Lt is a function of market risk returns (5) we 

may calculate our estimate fitij) from fx\z=zt (x) for every t. In the case of 
mixture distributions even for linear portfolios this turns out to be difficult, 
if I/(X) is a non-linear function. A general way to calculate the Value-at- 
Risk of a portfolio is to apply Monte Carlo Simulation and to use a large 
number of hypothetical risk factor returns 

~ fx\z=zt (x) , t = T -f 1, . . . , T + A', z = 1, . . . , Amc 

to calculate a hypothetical distribution of portfolio losses FL,t- The Value- 
at-Risk (see 5) can be determined as follows: 



^i,t 




(15) 


hAi) 


Nmc 


(16) 


VaR( 


= Klip) 


(17) 



We will not present results for VaR estimations here. Some work has been 
done on exchange rates and currency portfolios so far. Results for portfo- 
lios containing US-Dollar, Swiss Franc and Japanese Yen were reported in 
Locarek-Junge and Prinzler (1998a) and for a single US-Dollar portfolio in 
Locarek-Junge and Prinzler (1998b). 



5 Summary 

The results of our study show that Mixture Density Networks may offer 
an alternative to VaR models that rely on the assumption of normality in 
market risk returns. The methods outlined here can be applied to arbitrary 
portfolios though. In this paper we considered very simple Mixture Density 
Networks. More complex networks which may be required for multifactor 
portfolios need to be regularized and will be studied in further research. 
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Abstract: This paper takes a new look at the old theme of forecasting daily 
USD/DEM changes. Fundamental data with daily availability are used to build up 
quantitative models. The purpose of this paper is twofold: The first contribution 
of the paper is to analyse the influence of discretization to financial data. Second 
it examines the capability of a neural network for forecasting daily exchange rate 
movements and compares its predictive power with that of linear regression and 
discriminant analysis in case of discretized data. Thus the objective of this study 
is to address the issues faced by users of quantitative forecasting systems in terms 
of appropriate data transformations and model selection. 



1 Motivation: Daily Fundamental Models for 
Currency Markets 

This study focuses on the relationship between the fundamental environ- 
ment and the daily change in USD/DEM. To date only little research was 
conducted in this field. The main reason for this is given by considerable 
data limitations. Thus it makes no sense to use monthly published macroe- 
conomic data to model daily currency movements since exchange rate move- 
ments are dramatically more volatile than the fluctuations in macroeconomic 
environment. To tackle these data limitations the way followed here to fore- 
cast exchange rate movements is model building solely based on fundamental 
data given by the most important interest rate futures. The application of 
interest rate variables seems to be reasonable for at least two reasons: 

• The interaction between interest rate and exchange rate movements 
are supported by the interest rate parity theorem. 

• Besides the currency market interest rate futures are the second most 
liquid sector of financial markets. 

Thus there is a tight relationship between these two market segments in 
terms of theoretical background, arbitrage opportunities and liquidity. Based 
on these considerations we chosed the following interest rate futures of the 
US and Germany covering monetary and bond markets: 
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1. US 3 months money market future: Daily price change 

2. US 3 months money market future: Actual future price minus next 
future price, daily change 

3. German 3 months money market future: Daily price change 

4. German 3 months money market future: Actual future price minus 
next future price, Daily change 

5. Interest rate US 10 years: Daily yield change 

6. Interest rate Germany 10 years: Daily yield change 

7. Money market future: US 3 months - German spread 3 months, daily 
price change 

8. Interest rate spread: US 10 years - German 10 years, daily yield change 

Two main features regarding the interest rate spread changes are reflected 
by the chosen variables: 

• First, we focus on the issue of whether market expectations of the 
future money market term structure and thus expectations of further 
central bank policies have predictive power. Therefore three- and six- 
month money market rates of the US and Germany, given by the cor- 
responding money market futures changes (variables (2) and (4)) are 
considered. 

• Second, the interest rate spread change between these two countries 
describes the relative attractiveness of flxed income investments and 
is modelled with variables (7) and (8). 

2 Discretization 

The output variable, the daily change of USD/DEM, was included into this 
model as a continuous variable and further as a discretized one. The reason 
for discretization of exchange rate changes lies in the empirical evidence that 
the distribution of flnancial time series is largely leptocurtic. (see for exam- 
ple Affleck-Graves / McDonald (1989)). That means, that the distribution of 
these data the data is unimodal and approximately symmetric with a higher 
peak and fatter tails than the normal distribution. Therefore discretization 
should serve as a method to avoid modelling of the noisy small fluctuations 
and to support modelling of the worthful large fluctuations. In particular 
the fundamental interest rate based model should provide evidence if large 
fluctuations of exchange rate movements are caused by previous important 
interest rate changes. In principle, a discretization is simply a logical con- 
dition that serves to partition the data into at least two subsets. The main 
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question is how to set the so-called cut-points to split the numerical at- 
tributes. Four outpoints are necessary for a discretization into five classes 
as used in this study. In addition to the empirically standard deviation of 
0.45 % (i.e. receiving classes -2 and +2) the half of the standard deviation 
of 0.23 % was applied to separate two additional classes (-1 and +1). Then 
the class unchanged is given by fiuctuations between -0.22 % and 0.22 %. 



3 Forecasting Methodologies - Neural Net- 
works 

In addition to classical time-series forecasting methods (see for more Box- 
Jenkins (1976)), neural networks are now widely used for financial forecast- 
ing. Examples using neural networks in currency applications include Green 
and Pearson (1994), Manger (1994), Rawani (1993), and Zhang (1994). 
Feed-forward backpropagation networks are the most commonly used net- 
works. Overfitting is a well-studied phenomenon (Weigend (1993), Anders 
(1995)) where a learning algorithm tunes the parameters of a model too 
closely to the training set, thus reducing the overall generalization perfor- 
mance of the model. Nearly all financial applications of nonparametric mod- 
els (such as neural networks) vary model complexity by either early stopping 
or adjusting network size. Likewise, nearly all studies apply cross-validation 
to select the best model. Performance on the validation set can be used to 
set parameters such as learning rates, when to stop training. But this class 
of methods suffers since it can be a noisy approximator of the true gener- 
alization error (MacKay (1992)). Another often used method for improving 
generalization is regularization by attaching a penalty term to a models error 
term (Weigend (1993)). But this regularization method has a considerable 
disadvantage as the expansion of the models error term generates an addi- 
tional bias (Anders (1995), p. 35). Given all these drawbacks in this study 
another way to tackle overfitting is chosen. It is based on the idea of the 
combination of predictions. Early research of some statisticians (e. g. Bates 
und Granger (1969). The combination of different neural networks provides 
as a reliable tool to avoid overfitting (see for more Groot (1993)). To achieve 
different network outputs xi and different weighting sets, respectively, x net- 
works with the same topology were optimized with different random weights. 
Then the forecasts yt -f 1 were calculated with 



k 

yt+i = '£,Xi^t+i ( 1 ) 

i=l 

where 4- 1 represents a 1-period-ahead-forecast based on k individual 1- 
period-ahead forecasts + 1 {i = 1, ..., k). The number k of the different 
networks was set to 5. Table 1 contains the values of the parameters of one 
single network. 
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Parameter 


Value 


Number of hidden neurons 


3 


Activation function 


Tangens hyperbolicus 


Learning rate: 


0.02 


Momentum: 


0.90 


Number of iterations: 


1500 



Table 1: Parameters of one single network to forecast USD/DEM change 

4 Results 

It should be interesting to see whether (I) a quantitative method is able to 
beat the naive prediction of the random walk, (II) a discretization of forecasts 
can improve the continuous forecasts and (III) the nonlinear forecasting 
methodology of neural networks is able to capture possible nonlinearities to 
perform better than the corresponding linear counterpart. As a result five 
models were tested: 

• NAIV - Naive prediction 

• OLS - Multiple linear regression: Forecasting continuous USD/DEM 
change 

• DISKR5 - Discriminant analysis: Forecasting discrete 
USD/DEM change, fivefold classification 

• NNCON - Neural Network for forecasting continuous change of 
USD/DEM 

• NNDI5 - Neural Network for forecasting discrete change of USD/DEM, 
fivefold classification 

To check the forecasting ability of the neural networks three benchmark 
methods are applied: Besides the naive prediction in terms of a trendcon- 
tinuation (i. e. the forecast for the next day is the actual change of the 
exchange rate) linear multiple linear regression serves as a benchmark for 
the forecast of the continuous change of USD/DEM. Further linear discrim- 
inant analysis is the benchmark for forecasting the discretized daily change. 
This technique is used to classify a set of independent variables into two or 
more mutually exclusive categories. 

The estimation period covered 327 observations beginning with December 
4th, 1995 until February 28th, 1997. The generalization set consists of 130 
changes from March 3rd until September 28th, 1997. Four performance 
measures, namely accuracy rate, annualizesd return, Sharpe-Ratio and reli- 
ability, are reported for this generalization period (see for more Zimmermann 
(1994)). 
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Model 


Accuracy % 


return % p.a. 


Sharpe-Ratio 


Reliability 


NAIV 


41.86 


-23.56 


-29.25 


28.77 


OLS 


55.81 


1.34 


0.96 


56.94 


DISKR5 


38.76 


-17.67 


-20.58 


44.29 


NNCON 


41.09 


-27.10 


-27.43 


35.59 


NNDI5 


55.81 


2.71 


2.39 


52.46 



Table 2: Backtest: Out-of-sample performance 



The results of the forecasting comparison are reported in table 2 and deserve 
further attention. 

• Using multiple linear regression or neural networks trained on five- 
fold classification it was possible to outperform the naive prediction in 
terms of higher accuracy and annualized return. This evidence delivers 
an indication that the application of fundamental data to forecast daily 
changes of USD/DEM could have some systematic predictive power. 

• Interesting conclusions can be drawn considering the results of the neu- 
ral networks: The best performance is delivered by the neural network 
trained on fivefold classification (NNDI5). The neural network trained 
on continuous data achieved the worst performance in terms of annu- 
alized return at all. This is an indication to an overfitted architecture 
despite the conducted averaging measure to avoid this. 

• Linear methods, i. e. discriminant analysis, underperform clearly the 
nonlinear neural network procedure. The lack of discriminant analysis 
might be the high information loss by the conducted discretization. 
Only neural networks are able to generate added value if applied to 
discrecte data. Thus the discretization of USD/DEM into five classes 
seems to be an interesting measurement to exploit more information 
of the used data set. 

• Regarding risk aspects, i. e. the Sharpe-Ratio, NNDI5 provides better 
outcomes than OLS. 

Thus the fivefold discretization together with the application of neural net- 
works delivers worthful insights into Dollar/mark exchange rate movement 
even for the trading desk. The model captures rather well the relation- 
ship between important interest rate changes between the US and Germany 
and the daily change of USD/DEM. To sum up this paper gives empirical 
evidence that data transformation and model selection have considerable in- 
terdependencies. Especially users of nonlinear methodologies have to regard 
this issue. 
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1 Introduction 

Multidimensional statistical analysis uses techniques of variables selection 
and aggregation. Numerous economic phenomena belongs to the applica- 
tion field of multivariate analysis. Generally all phenomena described by 
large number of variables may be analysed within this framework. Those 
phenomena require descriptive characteristics that bring difficulty of clear 
understanding and unilateral evaluation. Bank evaluation is one amidst the 
others that undoubtedly demands multidimensional analysis. Bank as an in- 
stitution of public trust is evaluated both from outside (client, stockholders, 
investors) and inside (bank management). This calls for better methodol- 
ogy: transborder reference system might be regarded as one of the interesting 
proposals. 

The system employs the approach of MCDM techniques i.e. (multiple cri- 
teria decision making) and refers to reference objectives (French: but de 
reference^ German: Referenzziel). Generally speaking transborder reference 
system defines certain reference system which can be considered as compar- 
ative basis for evaluation and gives the opportunity to for mulate definite 
requirements by the evaluators, the receivers, or the ones who order the 
evalua tion. 

The bank assessment process is being done in three main steps: 

Step I - clustering of the measures values reflecting banks (or their branches) 
functioning indicators. 

Step II - normalising the measures values. 

Step III - construction of the aggregate measure (indicator). 



2 Classification of the Variables (Indicators) 
Reflecting Banks (or their Branches) Func- 
tioning 

The performance indicators are classified into three types: Stimuli - where 
higher value means better performance; Destimuli - where low values in- 
dicate better performance; and Nominants - where the best value (or best 
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value interval) is implied - if the measure has implied value or value within 
implied interval - bank performance is positively assessed. Such a general 
measures classification needs further analysis (comp. [7]). In the paper deep 
inquiry into bank performance measures typology is presented. Division 
listed doesn’t explore all possible evaluations so it requires further division 
in more detailed classification. 



STIMULI, with values Xkj G R are classified into: 

1. Stimuli Si without veto threshold value, with measure values . Xkj G i?+ 

2. Stimuli S 2 with veto threshold value with measure values Xkj G R. 
The example of stimuli without veto threshold value could be the balance 
of the bank and income volume of the bank. 

As an example of second type stimuli one can mention the indicator of bank 
solvency with minimum threshold established by BTS Bank of International 
Settlements at the level of 8%. 



DESTIMULI , with values Xkj G are classified into: 

1. Destimuli D\ without veto threshold value. 

2. Destimuli D2 with veto threshold value x^^ . 

Generally one can expect that indicators which have destimuli character do 
have veto threshold value. 

Indicator of the so called ’’bad credits” or ’’difficult credits” in the bank cred- 
its portfolio may serve as an example of the destimuli with veto threshold. 
In an established market economies it is assumed at the level of 5% (comp. 
[3]). As next example could stand ratio of ’’bad credits” and operating as- 
sets which in accordance with D. Blickenstaff [7 p. 35] should be higher than 
1%. i.e. x^^ — 1%. This same author defines veto threshold for the rate of 
over heads in operating assets, at the level not higher than 2%, i.e. x^^ — 2%. 

NOMINANTS , with values Xkj G are classified into: 

1. Nominants Ni - with recommended nominal value 

2. Nominants N 2 - with recommended interval of values which ends are 

defi ned by the veto threshold values x^^ and x^^ . 

3. Nominants - with recommended nominal value and acceptable value 
interval which ends are defined by the veto threshold values and . 



Current liquidity measure defined as current assets to current liabilities can 
be shown as example of second type nominat. The value of the ratio should 
lay within the certain interval e.g. [1.0 - 1.25]. Dropping of the ratio below 
1.0 signals that financial safety is endangered. Too high value of this measure 

iV^ 

indicates over liquidity of a bank. In our notion we have x^^ = 1.0; and x^ ^ 

= 1 . 2 . 



"OJ 



-OJ 



As an example of third type nominat can stand ratio of deposits to credits. 
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with optimal value equalling one and allowable (acceptable) values could 

come up to — 0.8 and =1.6. Allowable value interval is certainly 
discussible and should be individually settled for each bank, depending on 
its credit portfolio and financial market volume etc. 

It could be regarded that the displayed classification does not show all the 
possibilities and will be supplemented in future. 



3 Normalisation of the Variable Values 

As the very important part of the solution, methods for the values nor- 
malisation of variables (indicators) allowing to identify (distinguish) banks 
which don’t meet the declared or recommended threshold values = {k = 
1, ..., X) should be proposed. 

Due to that need the transborder reference system has been introduced (see 
[7]). Transborder reference system is regarded (defined) as a set of lim- 
its or recommendations allowing identification of considerably worse banks 
which break certain established limits or recommended indicators values. 
The evaluation process may be done after completion of the normalisation 
procedure for indicators values in particular banks. 

The main objective of normalisation of the variables is creating the situ- 
ation where comparisons are allowable. Simultaneously normalisation de- 
prives all variables of their measures units and unifies their values levels. In 
our case - normalisation of the variables values allows to apply compara- 
tive analysis between the banks and between the measures and also enables 
synthetic analysis. Normalisation of the identifying features is necessary 
when applying procedure of multidimensional statistical analysis as a tool 
of classification and establishing of the objects hierarchy. Reference mate- 
rial acknowledges several normalisation formulas (comp. [10, 5, 6, 7]). Our 
proposal is based on introduced classification of measures and allows achiev- 
ing the established target i.e. identification of banks which do not fulfil the 
recommended measures’ values. 

Normalisation of the features (measures values) should be done as follows: 

STIMULI 



1. The variable values which have stimuli character of type , i.e. without 
veto threshold value, with measure values x^j G ; in the notion j G 5i; 
are normalised as follows: 



^kj — 



^kj 

maxk{xkj} 



; Xkj e R+-, forj e Si 



^kj ^ 



maxA:{xfcj}’ 



( 1 ) 

( 2 ) 
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where: 

Xkj- value of variable-j in k-bank; 

Zkj- normalised value of variable- j in k-bank. 

2. The variable values which have stimuli character of type S 2 , i.e. with 
veto threshold value with measure values Xkj € R-, in the notion j € S 2 ; 
are normalised as follows: 



' for Xkj > xf] 

- 1 for Xkj < xf? 

oj 



forx^] > 0; 



Zkj € 






^OJ 






Zkj — 



Xkj 



for Xkj > xl] 



a^TT-i 



maxk{xkj} 

maxA:{a:fcj} 



forx^] = 0 



Zkj € 



minfc{xfcj} _ ^ ^ 
ma,Xk{xkj} ’ 



DESTIMULI 



( 3 ) 

( 4 ) 

(5) 

( 6 ) 



1. The variable values which have destimuli character of type Di , i.e. 
without veto threshold value, with measure values Xkj E R+ ; in the notion 
j E Di; are normalised as follows: 



minA,{a:fcj} 


( 7 ) 


Xkj 


[maxjfc{a;*,j}’ 


( 8 ) 



2. The variable values which have destimuli character of type D 2 , i.e. with 
veto threshold value x^f, with measure values Xkj E R+; in the notion 
j E D 2 ; are normalised as follows: 



Zkj — 



mirifc{xfcj 



^kj 



for Xkj < x^f 
- 1 for Xkj > x^f 



^kj 



Zkj G 



Do 

T . 

'^03 

maxfc{a;fcj} 



( 9 ) 



(10) 
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NOMINANTS 



1. The variable values which have nominant character of type A^i, i.e. with 



recommended nominal value with measure values Xkj G in the 
notion j E are normalised as follows: 



^kj = 



1 

X 



1 

% 

_ 1 

Xkj 



for 

for 

for 



Xkj = 
Xkj ^ X^ 



0] 



Xkj > X^f, 



( 11 ) 



^kj ^ 



mm 



miiikixkj} 

XqJ 



- 1 ; 



X 



Ni 

oj 



maxk 



{Xkj} - 1 



( 12 ) 



2. The variable values which have nominant character of type with 

recommended interval of values which ends are defined by the veto threshold 

values x^j^ and x^j^ , with measure values Xkj G in the notion j E N 2 ] 
are normalised as follows: 



Zkj — 






for x„f < Xkj < X 



- 1 for Xkj < X 



2 
Xkj 



N\ 

oj 



7V| 

oj 



— 1 for Xkj > X 



oj 



(13) 



Zkj € 



mm 



mink{xkj} 

Ni 



X 



O] 



Ni 

Xgj 

max*; 



{xkj} - 1 



(14) 



3. The variable values which have nominant character of type i.e. with 
recommended nominal value and acceptable value interval which ends are 

defined by the veto threshold values x^f and x^f , with measure values 
Xkj G in the notion j e Ns] are normalised as follows: 



Zkj 



= < 



1 




for 


Xkj 


= 


'^oj 






T ^ 




for 


^0/ 


< 


Xkj 


< 


Ni 

Xgj 


Xkj 




for 


d^oj 


< 


Xkj 


< 


Ni 

Xgj 


^kj 


- 1 


for 


Xkj 


< 


^oj 






















Xkj 


- 1 


for 


Xkj 


> 


Ni 

^oj 







(15) 



^kj € 



\min 



minfc{a;fej} 

^3 

Xgj 



- 1 : 



^1 

^gj 



max* 



{xkj} - 1 



;i 



(16) 
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Interpretation of normalised values of measures is intuitive and straightfor- 
ward. Values close to one can be read as high bank ranking due to the given 
indicator. Value close to lower limit of interval denotes considerably worse 
bank ranking. The lower limits are given in following formulas: 



[maxA^jxfcj}’ 



Zkj € 
Zkj € 



Zkj G 

mmk{xkj} , , 

So 



X. 



minfcixfcj} ^ 
maxfc{a;fcj} - r 



Zkj € 



^kj ^ 



irnin 



min 



^kj ^ 



^kj ^ 



^oj 






minit,{a:fcj} ^ 
[maxfclxfcj}’ 



X. 



D2 

oj 



[ma.Xk{xkj} 



for j € Si 

f or j 6 S 2 and x^j > 0 
for j € S 2 and = 0 
forj € Di 
forj G D 2 



,1 



Zkj S 



min 






oj 





- 




1 ; ^ {xkj} - 1 


;i 


forj € Ni. 


maxjfe J 


- 


v| 






- 1 ; - 1 


;i 


; forj G N 2 


rri3-x^ 










-l;^{:r,,}-l 

maxjfc 




; forj G A3 



(17) 

(18) 

(19) 

( 20 ) 
( 21 ) 
( 22 ) 

(23) 

(24) 



Proposal of normalisation of variables is based on the idea of transborder 
refe rence system, what in consequence responds to the indications of Dale 
Blickenstaff, Warsaw Banking Institute adviser - in which he says that good 
bank should fulfil key bank indicators. The list of key indicators are pre- 
sented in table 3 (the list is based on the proposals given in Patterson’s work 
(see [7], p. 35). 



4 Construction of the Aggregate Measure (In- 
dicator) 

Construction of aggregate measure means the construction of synthetic model 
for certain set of banks or one bank with multibranch structure. Aggregate 
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Measure 


Assumed value of veto 


1. 


operating as sets 
assets 


> 90% 


2. 


margin of income 


> 5.0% 


3. 


overheads 

totaloperatinpassets 


< 2.0% 


4. 


oadcredits 
operating as sets 


< 1.0% 



Table 1: Examples of minimal value of indicators (Source: Based on the 
paper [6]) 



measure allowing bank evaluation is being built upon the formula of ’’av- 
erage” normalised values assumed for evaluation variables. The measure is 
given in the following formula: 

Sk = for j € 5i, 52, -Di, D 2 , N,,N 2 , (25) 

m. 

where: - value of j-feature for k-bank normalised according to appropriate 
the formula (1), (3), (5), (7), (11), (13), (15). Normalisation performed 
using the formulas (17) - (24) has a common property that upper limit of 
aggregate measure is valued one, i. e.: 

5^ < 1. (26) 



We have to define minimal value of bank evaluation measure which can be 
still considered as positive i.e. veto point for measure (25) or threshold for 
minimal satisfaction. To achieve that we introduce the following aggregate 
measure: 

g = Sii«, (27) 

m 

where 



miiik Zkj 

^£i 

m&Xk{xkj} 

min/c Zkj 

DJ 

1 

1 

^3 
X ■ 

X ■ 

X ■ 



or 



or 



mink {xkj} 
maxk{xkj} 



minkjxkj] 

ma.Xk{xkj} 



for j e Si 

for j e S2 

for j e Di 

for j G £>2 

for j G Ni 
for j G N 2 

for j G N3 i x^f < x^f 

for j £ Nsi j e Ns i x^f < x^f 



Zoj — < 



(28) 




480 



These statements allow us to say that bank P.k = (A: = 1, . . . , iiT) will receive: 

1. Positive assessment, if 

> 5o (29) 

2. Minimum satisfaction assesment, if: 

5, = 5o (30) 

where: is given by (27) and (28). However measure (25) does not fulfil the 
usual normalisation conditions, normally attributed to aggregate measures. 
This disadvantage has been eased by the fact of existence of upper limit 
equal to one. Moreover restrictive normalisation evidently accommodates 
the possibility of high ranking of object by averaging the lowest and highest 
values of measures. The proposed principles of normalisation highly weaken 
the objections towards additive measures which are not robust to extreme 
values (the worst and the best). 

The proposed system of normalisation introduces certain norms which indi- 
cate range of evaluation measure. As it could be easily noticed the formula 
(25) and (27) - (30) allows to construct ’’border banks” where one of them 
lies at minimum satisfaction level and the other one at maximum level of 
evaluation. The values assumed for variables would be defined for those 
banks by the formula: 

pmax . L.max max max max] (o-i\ 

• [yoi 5i/02 ^Voj ?i/0mj5 w-*-; 

md.yik{xkj} for j G Si 

mdiXk{xkj} for j G Si 

minfcixifcj} for j G Di 

miuk{xkj} for j G A (32) 

nomk{xkj} for j G Niandj G Ns 

Xkj ^ x^f , x^f for J e N 2 

Obviously ’’the best” ’’Border” bank possesses all normalised variables mea- 
sures which equal one: 

(33) 

Bank of minimum satisfaction has the values measures indicated by formula 
(25). Minimal satisfaction point for bank evaluation classifies the banks 
(branches) into two groups: 

Group I - banks which exceed the veto threshold of aggregate measures 
fulfilling minimum satisfaction of evaluation. Group I will include banks for 
which: 

(34) 



where: 




so it means: 
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S*:e[So,l]. (35) 

where: is given by (27) and (28). 

Group I could undergo further classification into average, good and very good 
banks. This classification requires the formula of veto point for aggregate 
measures which allots the limit for proposed classes. We have to highlight 
the fact that settlement of veto point is extremely viable and responsible. 
Group II - banks which do not fulfil the minimum evaluation satisfaction. 
Group II will comprise banks for which: 



Sk < So (36) 

International norms as Gooc’s indicators set at the level of 8% can be con- 
sidered as equivalent of veto points. Not all measures possess veto points 
what has been included in the classification. 
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Abstract: In spite of its overwhelming success in theory and practice the short- 
comings of the neo-classical approach in finance are well-known. It cannot explain 
many important phenomena, so that serious doubts on the generated positive re- 
sults arise. Consequently interest shifts towards the neo-institutional framework, 
but it is quite difficult to judge the relevance of its results, since they are very 
sensitive to assumptions and details of the models. For instance, solution mech- 
anisms for (financial) problems should be “incentive compatible” . This describes 
an idea and not an exact definition. There exists a variety of operationalizations, 
and due to the mentioned lack of robustness some sort of classification of models 
turns out to be an important task for future research. 

1 Introduction; The Neo-institutional Ap- 
proach — Its Reason and its Main Prob- 
lem 

The starting point in analyzing the firm’s financial sphere — in historical 
view as well as commonly for didactical purposes — is the description and 
discussion of capital procurement possibilities. Capital demand results from 
requirements of the firm’s productional sector. In order to check feasibility, 
rentability, and efficiency this leads to the construction of financial plans, 
to the derivation of liquidity or other thumb rules and so on. But due to 
real world’s complexity it is difficult to identify the crucial aims and decision 
criteria and to separate important from less important features. 
Consequently the neo-classical theory relies on radical simplification. In- 
stitutional details vanish, and the “pure” market and its mechanisms re- 
main. This leads to very clear-cut results. In judging investment decisions 
the present value represents the central concept, where the appropriate dis- 
count factors are determined by the market and preferences are influenced 
solely by expected cash flows and their standard deviations. The world of 
neo-classical theory is well-defined. The results are generated by a set of 
strong assumptions: All market participiants are rational and perfectly in- 
formed, every cash flow can be traded. In addition, the main results require 
market equilibrium and non-existance of taxes and transaction costs. 

The merits of this approach are enormous, more or less it influences the 
common way of thinking about financial affairs. It gives orientation in a 
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complex environment and in addition some concrete instructions how to 
take presumable good financing decisions. 

Nevertheless, it isn’t difficult to critizise the neo-classical approach. First of 
all the assumptions are far away from reality. In itself this doesn’t refute 
neo-classical theory, so that a broad field for empirical investigations of its 
validity arises. While there is no definite answer to this, without doubt it is 
desirable to search for useful model modifications. 

One such attempt is the neo-institutional approach. The neo-classical 
theory shows the idealized working of market mechanisms, but capital mar- 
kets and financial transactions suffer from heavy imperfections. So one has to 
reintroduce some details into the model and to look, whether they cause se- 
vere changements in results and of what kind these changements are. These 
models explicitly take into account, that contract partners have — depend- 
ing on the assumed situation — very special sets of possible actions, because 
they typically act on the base of different knowledge about the relevant data 
(asymmetric information), which results in severe control problems in 
most contractual relationships. 

In this setting observable institutional phenomena can be the reason for 
the problems, but they also can be the result of trying to mitigate them. 
Obviously there are a lot of information and control problems in practice. 
Models dealing with more than very few of them simultanously can’t be 
handled — neither mathematically nor in stringent verbal argumentation. 
Moreover, even if only one or very few aspects are treated, there is always a 
need for further simplifying assumptions. 

In the consequence this leads to a variety of different models, yielding differ- 
ent, sometimes contradictory results, so that neo-institutional theory has to 
face the reproach of some arbitrariness. We will discuss this drawback by the 
example of incentive compatibiliy, a central conception of neo-institutional 
theory. While intuitively the term’s meaning is clear, there are different ways 
how to operationalize it and consequently there is no unique definition in 
the literature. This paper wants to attract the attention to the importance 
of the arising question formulated in the title; unfortunately, it won’t give 
any answer. It rather suggests that here is a need for some systematization 
or classification. 



2 The Conception of Incentive Compatibility 

For the sake of simplicity we will speak about incentive compatibility in the 
vocabulary of one of its most common variants. A so-called principal del- 
egates a decision or an activity to one or more so-called agent (s). Because 
of asymmetric information and non-observable actions the principal can’t 
value the quality of the agent’s activities. This constitutes a problem, since 
agents can try to maximize their own utility — probably in contradiction 
to the principal’s interests. The task for the principal is to find arrange- 
ments, which induce the agents to activities leading to the best reachable 
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result for the principal (the second-best solution instead of the first-best 
solution, if information and control problems wouldn’t exist). 

The principal agent framework is a paradigm, which embraces more than 
the standard interpretation of an employer and his employees. It covers in 
suitable formulation allocation mechanisms for private as well as public 
goods. Also a financing decision can be seen as a principal agent situation, 
where the financier is the principal and the manager/owner is acting in his 
behalf to ensure maximal profit from the invested capital. 

Because of the restricted space we can give here no comprehensive formal 
description, but some hints will help to enlighten the exposition. Let a = 
(ai,...,a„) denote the actions of the n agents — where in general is 
a vector and can describe parametrized activities, signalling, transmission 
of information and so on — and 0 a random variable. Then the result 
G(a, 0) is a function of actions and stochastic influences. In searching for 
incentive compatible solutions the functional form of G is exogenously given 
and a question of model construction. 0 and G in general will be multi- 
dimensional, too. 

If we ignore control activities, the main instrument for influencing the agent’s 
behavior is the reward R = (i?i, . . . , i2n)- In principle, the functions Ri can 
depend on all commonly observable data such as total gain, individual results 
or observable activities. This dependance is fixed by the principal. We call 
it mechanism M, the resulting reward is denoted by R^ . 

Therefore the construction is hierarchical. Agent i with utility function ui 
reacts upon every given mechanism M by maximizing his expexted utility 
Eui{R^) with respect to a^. The principal anticipates this behavior and 
fixes, as far as his informations allow, the mechanism in order to ensure the 
— in his view — best possible result. This mechanism, which leads to the 
above mentioned second-best solution, is called incentive compatible. 

So an incentive compatible mechanism is defined to be the solution of an 
optimization problem. The aims of the principal constitute the goal function, 
the incentive constraints render the individual utility maximizing of the 
agent (s). 

This nucleus is varied and extended in the different settings. Modifications 
refer to the above mentioned modelling of functional dependencies, but also 
to the introduction of additional constraints, for instance cooperation con- 
straints. The resulting vagueness of the term “incentive compatibility” is 
twofold. Differences caused by different application situations probably may 
be regarded as unavoidable and quite insignificant. More severe is the find- 
ing, that in analyzing one and the same application differing models are used 
and, moreover, sometimes produce contradictory results. 

We will discuss by three examples some of these variations in model building. 
This will be, of course, not exhaustive. So we won’t reason about the a priori 
chosen type of mechanism (direct versus signalling, for instance), the static 
approach (versus sequential analysis) or the assumption of pure hidden 
action or moral hazard (the principal cannot observe actions) without 
hidden information (the agent has more information than the principal). 
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3 Incentive Compatibility and the Common 
Principal Agent Model 

The basics of the principal agent model are as described above. In the 
standard model (hidden action) the information of principal and agent is the 
same. The gain depends on the random variable 0 and the agent’s effort 
e, which the principal cannot observe. In behalf of maximizing the residual 
gain after the agent’s payment the principal applies incentive compatible 
payment schemes, which are tied to the agent’s success. 

(A) Variation in additional constraints: Sometimes payment functions 
are restricted a priori to linear functions. This causes algebraic simplifica- 
tion, but it reduces the set of possible solutions and can therefore deteriorate 
the optimal solution. 

(B) Variation in considered variables: Some (german) literature^ — 
unlike classical principal agent literature^ — arrives at quite simple results^. 
It shows that no effort aversion is considered, in fact one of the essentials 
of principal agent theory. The results are those known from Ross’ (1974) 
predecessor of principal agent theory. 

(C) Variation in technical details: In the general case the principal 
agent model doesn’t allow for explicit solutions, so that technical simpli- 
fications have to take place. For instance, a quite general formulation of 
the agent’s expected utility is Eu;(r*(G(e)), e), where e is the agent’s effort 
(his action), G(e) the gain depending from e, r(-) the payment function and 
w the agent’s utility, which is increasing in r and decreasing in e (effort 
aversion). In some papers'^ this is simplified by assuming additive separa- 
bility, i.e. w{r, e) = u{r) — T(e), where u is an usual onedimensional utility 
function and L describes the effort aversion. 

Instead, some other papers^ — especially those dealing with the so-called 
LEN-Model^ — assume that L is a function measuring effort on the same 
scale as monetary outcomes and that the agent’s preferences can be de- 
scribed by an (onedimensional) exponential utility function. This means 
constant absolute risk aversion a in combination with multiplicative 
separability: 

w{r^ e) = — exp{— q: (r — L{e))} = — exp{— or} • exp{oL(e)}. 

In combination with other assumptions (mainly normal distribution) this 
implies additive separable certainty equivalents. Both frameworks provide 
utility independance of monetary outcome and effort^, but only in the first 

^Cp., for ex., Laux and Schenk-Mathes (1992). 

^Cp., for ex., Rees (1985). 

^For instance, convexity of payment functions, if the agent is risk averse and the 
principal risk neutral. 

'^Cp., for ex., Rees (1985). 

^Cp. Grossman and Hart (1983), p. 11, for a first remark. 

®Cp. Spremann (1987). 

^Cp. Keeney and Raiffa (1976), p. 226. 
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case changes in utility caused by variation of one attribute are independant 
of the other attribute’s value. In both cases we get the same main results, 
but it isn’t clear, where these correspondences end. Definitively wrong, for 
sure, is setting both assumptions simultanously^. 



4 Incentive Compatibility and Budgeting 

We consider capital budgeting in divisionalized firms. Because the division 
managers are better informed about their division’s business, they report 
potential gains (depending on the assigned capital) to the headquarters, 
which, resting on these reported functions, optimizes the allocation with 
respect to net total gain. 

As usual the managers should be motivated to strive for the best result in 
their divisions attainable with the assigned capital. In addition, they ought 
to give true reports in order to enable the headquarters to fix the overall 
optimum for capital allocation. 

For the sake of briefness we omit stochastic considerations, hi denotes capital 
budget for division z, gi{-) the true and mi(-) the reported potential gains of 
division z, c(-) the capital cost function and 6* the assigned budget, where 
(6*, . . . , 6*) maximizes 






The literature® discusses three budgeting mechanisms. All of them rely 
on variable payment for the managers i = 1, . . . , n and they differ in the basis 
Pi, on which the payment (linearly) depends. For the Weitzman-Scheme 
we have (0 < < Si < ti^ 2 ) 



Vi 



^ { Si- mi{b*) + ti,i 

1 Si ■ rrii{b*) - tifl 



gi{b*i) - mi{b*i) 
mi{b*i) - gi{b*) 



if 9i{bi) > rriiibl) 

if < mi{b*), 



for Profit-Sharing 



j=i \f=i / 



and for the Groves-Scheme 



Pi = 9i{b*) + Y. ^iib]) - c • 



®Cp., for ex., Laux (1990), Kiener (1990). 

^Cp. references in Bamberg and Trost (1998). 
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All three proposals satisfy the first requirement, i.e. the payments rise with 
the actual success of the respective division. If they also induce truthful 
reporting, they are called incentive compatible. For details of the following 
discussion see Bamberg and Trost (1998). 

(D) Variation in the principal’s goal function: In the above given 
definition the actual goal of the headquarters, the net total gain, is replaced 
by the requirement of truthfulness. Pfaff and Leuz (1995) show that the 
resulting information premium can cause a divergence between the two goals. 

(E) Variation in the translation of economic reality: In every mod- 
elling process there has to be decided, which aspects of the real problem 
should be taken into consideration. One example is the budgeting process on 
the whole. The Weitzman-Scheme has been designed without respect to the 
self-evident fact, that in general the planning should base on the division’s 
reports. And indeed it shows, in spite of some support for this “solution” 
that the Weitzman-Scheme can’t solve the problem at all, because it sets 
strong incentives for false reporting. 

(F) Variation in the incentive constraints: Profit-Sharing as well as 
the Groves-Scheme is a solution, if Nash equilibrium is used as optimality 
criterion. Nevertheless, if we define optimality by the stronger and more 
convincing requirement of dominant strategies, Profit-Sharing is no more 
feasible. But if cooperative (collusive) instead of non-cooperative behavior 
of the agents is considered, Profit-Sharing can solve the problem and the 
Groves-Scheme in general will not.^^ 

Two other aspects already mentioned in connection with the principal agent 
model play a role in the budgeting problem, too: The discussions about so- 
called controllability — i.e. the requirement that payments should depend 
only on self influenced factors — and clearness can be seen as the introduc- 
tion of additional (informal) restrictions (cp. (A)). Furthermore, the results 
vary to the disadvantage of Profit-Sharing, if effort aversion is considered 
(cp. (B)). 



5 Incentive Compatibility and the Asset Sub- 
stitution Problem 

An instructive example to demonstrate the influence of what we called (E), 
the translation of economic reality, is the asset substitution problem due to 
Jensen and Meckling (1976): A manager-owner of a partially debt financed 
firm, whose liability doesn’t embrace private assets, may be tempted to 
realize more risky projects than were specified in the negotiations with the 



^®Cp. Arbeitskreis Finanzierung (1994). 

Sometimes the Weitzman Scheme is called Soviet Incentive Scheme, indicating its 
application in the former Soviet Union. 

^^To make things a bit more complicate: If the capital cost function c(-) is linear, the 
Groves Scheme is incentive compatible even if collusion is considered (cp. (E)). 
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creditor and would be optimal under pure equity financing. This effect is 
caused by the non-symmetric distribution of the manager-owner’s revenues 
under debt financing. 

An institutional regulation is called incentive compatible, if it removes this 
incentive. It is quite plausible that securities may do the trick, and just this 
result is derived by some formal models. A closer look shows, however, that 
the crucial point is the definition of “increasing risk” . For instance, in the 
models of Green and Talmor (1986) and Bester and Hellwig (1987) increasing 
risk is implemented by rising variance of the risky return combined with 
declining expectation. If these constructions are replaced by a very natural 
and seemingly appropriate one^^, namely the mean preserving spread, 
the situation turns out to be very ambiguous. For a detailed discussion and 
further references see Kiirsten (1997). 

6 Conclusion 

In consequence of the outlined problems we can argue for an important 
and challenging task for further research. The abridged argumentation is as 
follows: 

1. Incentive compatibility is an important guideline to think about the 
ingenuity of economic arrangements. 

2. There is no clear definition of the term “incentive compatibility”. 

3. A classification of assumptions sets and in the sequel of obtained results 
would be very desireable.^^ 

4. Subsequently an analysis could take place, whether some type of mod- 
els yield specific types of results. 

5. This would enable a systematic discussion on the reliability of neo- 
institutional models and help to mitigate the objection that neo-insti- 
tutional results are somekind of arbitrary. 



^^Appropriate, because the definition of risk aversion in expected utility theory, which 
is the overall framework of these considerations, relies on this definition. 

^'^Presumeably this try may fail in the case (E), here called translation of economic 
reality. 
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Abstract: As soon as the automotive industry in the western world countries 
recognizes that the outstanding performance of Japanese manufacturers is the 
result of a customer-oriented understanding of quality, the QFD approach became 
a famous instrument to achieve the product quality demanded by customers. 
Nevertheless this concept has its limits. In this article we suggest the extension 
of the QFD approach on the base of customer values and benefits. 



1 Introduction 



The significance of product and service quality as a major competitive suc- 
cess factor is undisputed. There is no alternative on hard-fought buyers’ 
markets made up of critical, demanding customers to consistent quality ori- 
entation. Recently, however, the design of product quality (above all Ger- 
man manufacturers) has come to be seen not merely as the task of a single 
functional unit, but as a central challenge for any company. This altered 
perspective was brought about by the realization that superior products are 
available in many branches of industry, in terms of both price/cost and qual- 
ity. This was accompanied by the recognition that the outstanding perfor- 
mance of Japanese manufacturers in particular cannot be entirely attributed 
to a higher, culturally-founded level of employee commitment combined with 
a lower level of aspiration. It is rather a customer-oriented understanding of 
quality - such as the quality function deployment concept (QFD) - which em- 
braces all operational functions that is responsible for their market success, 
(see Engelhard!, Schiitz (1991), Stauss (1994), Weber, Nippel (1996)). 

The provision of high quality products and services can be done in four 
phases. First of all you must understand the customer needs to generate 
potential solutions in the second phase. In the third part this would be 
translated into construction- and designattributes. In the final part the 
product is offered to the customers. The target of the paper is to show the 
advantages of the QFD approach. We focus on the first and second phase 
of the concept - the measurement of customer needs and the generating of 
solutions for the customer orientated product development. 
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2 The ’’classical” Quality- Function- 
Deployment- Approach 

The Quality-Function-Deployment- Approach was initiated in Japan in the 
mid 1960’s. Quality-Function-Deployment can be described as an approach 
to product quality design, which attempts to translate the ’’voice of the cus- 
tomer” into the ’’language of the engineer”. The core principle of this con- 
cept is a systematic transformation of customer requirements and expecta- 
tions into measurable product and process parameters. From the methodical 
point of view, it would appear useful to subdivide a quality planning pro- 
cess derived from customers’ expressed wishes into four separate phases (see 
Hauser, Clausing (1988), Gustafsson, Johnson (1996)). The House of Qual- 
ity is concerned with translating the purchase-decision-relevant attributes of 
a product that have been established, for example, within the framework of a 
Conjoint study into design features. These design features are subsequently 
transformed into part features during the parts development phase. The aim 
of the work preparation phase is then to define crucial operating procedures 
on the basis of the specified part features. The crucial operating procedures 
in turn serve to determine the production requirements in greater detail. 
Also customers are typically only familiar with current products and tech- 
nologies and asking customer what they want will not result in new to the 
world products (see Johnson (1997)). Furthermore, from a customers per- 
spective, a firm’s current products and services are just one possible means 
to an end. Firms that focus on root needs and values can develop totally 
new markets. Root needs in this context are rarely addressed in Quality- 
Function-Deployment. These could be things such as ’’feeling good after run- 
ning” , ’’being stylish in a certain set of cloths” , and ’’sense of accomplishment 
after cooking” . Aspects like these are rarely discussed in Quality-Function- 
Deployment literature. It is, however, thoroughly described in marketing 
theory. 



3 The ’’marketoriented” Quality- Function- 
Deployment- Approach 

3.1 The Idea 

This approach suggests that the attributes of a product are crucial to a con- 
sumer’s assessment of its usefulness. It is based on Lancaster’s proposition 
whereby consumers perceive products as bundles of attributes (see Lancaster 
(1966), Herrmann (1992)). However, it is not the intrinsic (physical, chemi- 
cal or technical) commodity attributes that determine the quality judgement. 
On the contrary, the value attached to a commodity is dependent on extrinsic 
(immaterial or non functional) attributes, such as the brand name and aes- 
thetic aspects. Moreover, behavioral science studies have documented that 
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the perception of product attributes by consumers - which is not necessarily 
identical to objective reality - controls purchasing patterns (see Brockhoff 
(1993), Urban, Hauser (1993)). The (perceived) attributes thus represent 
the most suitable determinants for conceiving marketing activities. 

In addition, it seems reasonable to state that consumers do not purchase 
a package of attributes, but rather a complex of utility components. This 
idea appears plausible, as buyers are only rarely aware of all the beneficial 
attributes of a product. In many cases it is also true to say that differ- 
ent attributes provide a concrete benefit and that one attribute can affect 
different utility areas. Nevertheless, there is not normally a ’’one-to-one” re- 
lationship between the physical/chemical/technical attributes and the utility 
components (see Herrmann (1996a)). This supposition illustrates one of the 
dilemmas of marketing policy: when a company develops a product, it is 
only able to decide the levels of this product’s intrinsic attributes. A con- 
sumer, on the other hand, bases his or her purchase decision on the utility 
conceptions derived from a perception of the product’s attributes. 

There is, however, cause to doubt that the utility expectations of the buyer 
do in fact constitute the ’’ultimate explanation” of purchasing behavior. On 
the contrary, the motives for individual actions can be better accounted for 
by stimulating forces, such as a set of values and the formation of an inten- 
tion. Nevertheless, very little attention is paid to these hypothetical con- 
structs when specifying a company’s performance. There are two reasons for 
this: firstly, numerous studies have documented the unsatisfactory associa- 
tion between these variables and purchasing behavior. It is an undisputed 
fact that specific behavioral patterns cannot be predicted, for example, on 
the basis of specific motives (see Kroeber-Riel (1992)). Secondly, there is no 
theory in existence at the present time which describes the interaction be- 
tween the hypothetical constructs and the relevant utility components and 
commodity attributes. Consequently, no evidence is available to support the 
notion that marketing activities are conceived according to the stimulating 
forces of purchasing behavior. 

The various problems which are raised in this connection can be solved by 
extending the quality function deployment approach from the point of view 
of marketing theory (see Figure 1) (see Herrmann et al. 1997). First of all, 
the means-end theory is used to integrate the set of values of a consumer (el- 
ement 1) with the utility components (element 2) or the attributes (element 
3) of a product. Those attributes which are relevant to purchasing behavior 
can then be transformed into design features (element 4) and part features 
(element 5), as well as operating procedures (element 6) and production 
requirements (element 7) with the aid of the ’’traditional” quality function 
deployment approach. Finally, an analysis of customer satisfaction provides 
information regarding the extent to which the product (element 8), which 
has been developed on the basis of the extended quality function deployment 
concept, corresponds to the consumer’s conception of utility (element 9) and 
to his set of values (element 10). 
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Quality Function Deployment Approach 



Figure 1: Extension of the quality function deployment approach from the 
point of view of marketing theory 

3.2 The ’’means end”-Theory 

The means end-theory is based on Tolman’s work. He first drew attention 
to the goal-oriented nature of individual behavior as early as the 1930’s. 
The central hypothesis of this approach maintains that consumers consider 
bundles of services to be instruments (means) for fulfilling desirable goals 
(ends). The potential buyer recognizes which service are linked to which 
values within the framework of an information-processing process, which the 
means-end method attempts to reveal (see Kroeber-Riel (1992)). During 
the 1970’s and 1980’s Tolman’s work served as a basis to elaborate the 
first ” means-end” models. The approaches all share the same objective of 
amalgamating a selected stimulating force (e. g. set of values, goals in life) 
with the physical-chemical-technical commodity attributes relevant to the 
conception of marketing activities. The ” means-end” model developed in 
the 1980’s by Gutman and Reynolds can be considered a combination of 
all previously known approaches (see Reynolds, Gutman (1988)). As figure 
2 shows, the fundamental structure of this model consists of the elements, 
namely attributes, utility components and set of values. 




497 



The attributes can be further subdivided according to their level of abstrac- 
tion (see figure 1). An attribute is concrete if its various levels describe the 
physical-chemical constitution of a product. An attribute can generally be 
observed directly or objectively, and often exhibits a finite number of discrete 
states. Whereas this type of attribute is normally only able to specify one 
facet of a complex, the term ’’abstract attribute” permits a comprehensive 
description of a phenomenon. The level of this type of attribute depends on 
the subjective opinion of the individual, rather than on a series of objective 
facts. 

According to Vershofen’s utility theory, the initial functional utility of a 
product is derived from its psychical-chemical-technical attributes (see Ver- 
shofen (1959)). The utility specifies the usefulness of the commodity as well 
as embracing the consequences of the product’s actual usage. The socio- 
psychological utility, on the other hand, includes all aspects which are not 
vital to the actual function of the product (additional utility). 

Graumann and Willig have suggested that sets of values represent series of 
individual standards - which remain constant over a period of time - serving 
to formulate goals in life and to formulate goals in life and to put them into 
practice in everyday behavior (see Graumann, Willig (1983)). A set of values 
thus constitutes an explicit or implicit conception of ideals, characteristic of 
the individual concerned, which controls the choice of a particular mode, 
instrument (means) and goal (end) of conduct. This view is also held by 
Rokeach, who defines a set of values as ” ... an enduring belief that a specific 
mode of conduct or end-state of existence is personally or socially preferable 
to an opposite or converse mode of conduct or end-state of existence ...” 
(Rokeach (1973)). By an ’’enduring belief in preferable modes of conduct 
and end-states of existence” , the researcher does not simply mean a cognitive 
representation or conception of possible plans of values (or the goal in life) 
over and above the cognitive component. The idea stems from Rokeach’s 
definition of the term ’’Value” as an ” . . .intervening variable that leads to 
action when activated ...” (Rokeach (1973)). 

If follows from this definition that a distinction needs to be drawn between 
terminal (’’end-states of existence”) and instrumental (’’modes of conduct”) 
values. The terminal values, which embody desirable goals in life, can be fur- 
ther subdivided into two categories. The group of personal values includes, 
amongst others, inner harmony, serenity in love, whereas global peace, na- 
tional security and a world full of beauty belong to the class of social values. 
The instrumental goals in life, which represent desirable forms of behavior, 
consists of moral and achievement-oriented values (see Herrmann (1996)). 
Whereas tolerance, a willingness to help and a sense of responsibility are 
examples of moral values, the set of achievement-oriented values comprises 
such attributes as logical intellectual and imaginative. 

A means-end chain, which represents a small part of the knowledge structure 
of an individual, can be constructed using the specified means-end elements 
(see jfigure 2). A consumer’s intention to purchase a product initially causes 
the concrete attributes and the abstract attributes associated with it to 
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be activated. This impulse is then propagated via the functional utility 
components and the socio-psychological utility components, before finally 
reaching the instrumental values and the terminal values. 



attributes 




utility 

components 




set of value 








/ \ 




/ \ 




/ \ 




Figure 2: The ’’means end”-chain 



3.3 The Customer Satisfaction Concept 

According to this theory, a supplier is faced with the constant challenge of 
achieving maximum possible customer satisfaction. The significance of the 
satisfaction rating for assessing the quality of a product derives from its 
function as an indicator of actual purchasing behavior (see Simon, Homburg 
(1997), Fornell et al. (1996)). The customer’s (dis)satisfaction is the out- 
come of a complex information processing process, which essentially consists 
of a desired-actual comparison of a consumer’s experience with a purchased 
product or service (actual) and his expectations with regard to the fitness 
of this commodity or service for its intended purpose (desired) (see Lin- 
genfelder, Schneider (1991)). The congruence or divergence yielded by this 
comparison between the perceived product quality and the anticipated qual- 
ity is expressed as the consumer’s (dis)confirmation. Of the many attempted 
definitions, the one best suited for the purposes of this work is that put for- 
ward by Anderson: ”... consumer satisfaction is generally construed to be a 
postconsumption evaluation dependent on perceived quality or value, expec- 
tations and confirmation/disconfirmation - the degree (if any) of discrepancy 
between actual and expected quality ...” (Anderson (1994)). 

Whether or not an individual considers his expectations to be confirmed by 
a purchase, so that he is satisfied with the performance of the supplier, is 
primarily dependent on the perceived quality. Quality perception is directly 
linked to the purchase and consumption experience (see Figure 1), and can 
be defined as a consumer’s global judgement in relation to the fitness of a 
product for its intended purpose (Zeithaml 1988). The individual concerned 
assesses each of the purchased product’s attributes that are of relevance to 
him with regard to their suitability (element 8), and then integrates the 
partial ratings in accordance with a decision-making rule to obtain a quality 
judgement (element 9). The buyer’s expectations represent a specific level 
of quality that he hopes to find in the commodity. They serve as a yardstick 
for appraisal by the purchaser, which can be used to measure the consumed 
product or service. The level of expectation is determined firstly by previous 
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consumption experiences, in other words by past encounters with the prod- 
uct in question (Fornell 1992). Secondly - and this applies in particular to 
situations in which a product is purchased and consumed for the first time - 
the consumer obtains, in addition to other preliminary information, an idea 
of the quality of the contemplated product, above all from the prices of the 
available alternatives. If the commodity in question matches the consumer’s 
conceptions in every respect, he will be satisfied with it (element 10). 



4 Discussion 

So it can be seen that the means-end theory, the customer satisfaction con- 
cept and the QFD approach are not competitive but rather complementary 
processes. The aim in customer satisfaction theory is to understand what 
the key utilities are in order to create satisfied customers. This while QFD 
uses this information as a starting point for translating these utilities into 
product characteristics and on to production requirements. 

Within this framework it can be demonstrated that consumers purchase 
a complex of utility components rather than bundles of attributes. The 
means-end theory attempt to integrate individual sets of value with utility 
components and product attributes. 
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Market Segmentation and Profiling 
Using Artificial Neural Networks 

Karl A. Krycha^ 

Institut fiir Betriebswirtschaftslehre, Universitat Wien, A-1210 Wien, Austria 

Abstract: An increasing number of applications of Artificial Neural Networks 
(ANN) in the field of Management Science demonstrates the growing relevance 
of modeling techniques summarized under this headline. This situation is also 
reflected in marketing where several papers published within the last few years 
have focused on ANN. To evaluate its potentials for market segmentation and 
consumer profiling a comparison of different approaches including connectionistic 
and traditional models will be performed by several independent experiments for 
a typical set of marketing data. 



1 Applications of ANN 

Originally research in the area of ANN was oriented towards representing 
real biological systems. Due to the high complexity of these systems only 
heavily simplified artificial networks are considered for computational pur- 
poses so far now. From an empirical point of view applications of ANN in 
management science are not that widespread as compared to applications 
in technical disciplines. When considering a sample of 90 papers published 
in the area of management from 1990 to 1996 we found not more than 
one single application of Self-Organizing Maps (SOM) in marketing and 
one in the field of production (Krycha and Wagner (1999)). Based on the 
number of experiments documented in this sample the multilayer percep- 
tron (MLP) has turned out to be the standard architecture used in 96% 
of all cases. For 81% the topology applied is three-layered consisting of 
one input, one hidden and one output-layer. The learning rule for train- 
ing of the network is the backpropagation algorithm or a slightly adopted 
version of it in 88% of all cases (cf. table 1). Contrary to our expecta- 
tions the samples used in these applications are rather moderate in size (not 
more than 50 observations for one experiment in the domain of marketing). 
The figures given in table 1 for the minimum, maximum and mean number 
of observations therefore support the applicability of neural network based 
modeling approaches also to small problems. Traditional methods used as 
a benchmark against ANN are indicated in the table in descending order 
regarding its frequency of use. For reasons of simplicity and clarity we used 
the total number of learning patterns plus holdout patterns before compu- 
tation of minimum, maximum, and average number of observations used 

^Financial support for this project was given by the Oesterreichische Nationalbank 
(Jubilaumsfondsprojekt 6215). The author further gratefully acknowledges the support 
of Dr. Artur Baldauf who provided the data set used in this study. 
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marketing 


finance 


production 


total 


ANN application 










classification 


50% 


87% 


loo% 


5i;2^ 


market response 


50% 


— 


— 


25% 


optimization 


— 


— 


— 


16,6% 


forecasting 


— 


13% 


— 


4,2% 


traditional 


DA, MRA, 


DA, MRA, MDM 


DP, TSP, 




method used 


LR, MCI, NBD 




Wagner-Within 




ANN topology 




r 






MLP 


97% (33/34) 


95,5% (42/44) 


90% (9/10) 


95.5% (84/88) 


three layered MLP 


85,3% (29/34) 


77,3% (34/44) 


80% (8/10) 


80.1% (71/88) 


SOM 


3% (1/34) 


— 


10% (1/10) 


2.3% (2/88) 


ANN learning 










min sample size 


50 


56 


25 


25 


max sample size 


4567 


228 


2200 


4567 


mean sample size 


795.9 


103.8 


956.5 


543.2 


learning by BP 


94% (32/34) 


84% (37/44) 


80% (8/10) 


87.5% (77/88) 



DA 


discriminant analysis 


NBD 


MRA 


multiple regression analysis 


DP 


LR 


logistic regression 


MDM 


MCI 


multiplicative competitive 






interaction model 


TSP 



negativ binomial model 
dynamic programming 
Mahalanobis Distance Measure 
(clustering) approach 
travelling salesman problem 



Table 1: Typical patterns of ANN-applications in management science 



in the experiments. From a problem oriented point of view ANN applica- 
tions in the sample are encountered when dealing with classification^ market 
response, optimization, or forecasting-idisks with a dominant use for clas- 
sification. In marketing connectionistic models are used for classification 
and market response modeling only (e.g., Hruschka (1993), Wartenberg and 
Decker (1996)). The term classification suhsumes both neural network based 
clustering approaches (e.g., Reutterer (1997)) and the prediction of group- 
membership (e.g., Mazanec (1993)). From a marketing practitioners’ point 
of view neural networks are attractive alternatives to a wide range of differ- 
ent models. In fact, also in cases where their traditional counterparts are 
performing better this remains a strong argument for the benefit of connec- 
tionistic approaches as the overall modeling structure remains essentially the 
same and the underlying functionality is easy to communicate. A substan- 
tial argument in favour of connectionistic models is its close relationship to 
the SOR-modeling paradigm which plays an important role as a conceptual 
framework in marketing decision finding (Wartenberg and Decker (1996)). 
In order to further evaluate performance and suitability of ANN applied to 
marketing problems this contribution presents a comparison of MLP and 
SOM to its statistical counterparts. 



2 Target Marketing 

A typical problem in Marketing Management deals with the structuring of 
heterogenous markets in order to identify the parts of the market that it can 
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serve best. This process is refered to as target marketing and consists of iden- 
tifying market segments (i.e., segmentation)^ the subsequent characterization 
of the resulting classes of consumers (i.e., profiling)^ and finally the selection 
of one ore more of them in order to develop products and marketing mixes 
tailored to each. In performing the first step of target marketing apriori 
segmentation becomes soon inadequate as the market becomes more sophis- 
ticated (Kotler et al. (1996)). A very powerful approach in such situations 
for many consumer markets is psychographic or behavioural segmentation. 
Attitudes towards a product measured by perceived product attributes have 
a close link to purchase intentions and can therefore be used as variables 
for building market segments. Statistical methods utilized in this context 
are, e.g., cluster analysis for segmentation and multiple regression analysis 
(MRA) or factor analysis (FA) for extracting meaningful consumer profiles. 
Discriminant analysis (DA) on the other hand can be used for classification 
of new consumers on the market. 

2.1 Data Set Used 

In 1993 the Austrian detergent market was characterized by a total mar- 
ket share of about 270 million Austrian Schilling a year at purchase price. 
The leading brand on the market at that time held a market share of about 
66% whereas its biggest competitor accounted for about 31% (AcNielsen). 
To gather additional information for the improvement of its product the 
management of SOMAT SUPRA decided to define a marketing research 
project concentrating on some particular aspects of the market. A question- 
naire was designed to test several hypotheses concentrating among others 
on the unique selling proposition of SOMAT SUPRA. The data collection 
took place in supermarkets and superstores located in out-of-town mega- 
malls, neighborhood shopping centres and city-centre shopping zones and 
resulted in a sample of 101 responses (Grasserbauer et al. (1994)). This 
conceptual framework serves as a baseline for the developed research pro- 
cedure considering the relative performance of traditional v. connectionistic 
approaches. For our further investigations we selected a subset of six items 
in the questionnaire. In order to measure perceived product attributes of 
SOMAT SUPRA the respondent is asked to complete these questions on a 
seven point semantic differential scale coded in the intervall of 0 to 7 (cf. 
figure 1). In the course of the following investigations answers on this scale 
are supposed to be near-interval scaled (cf. Churchill (1995)). 

2.2 Research Procedure 

For segmentation apriori clusters are compared to results formed by a SOM- 
type ANN and to results formed by Wards error sums-of-squares method 
of clustering. For each clustering method several solutions with respect to 
the number of clusters are compared. 

For profiling a standard multilayer perceptron (MLP) neural network archi- 
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"Please rate the following product attributes of SOMAT SUPRA and of CALGONIT ULTRA 


using the printed scales (a package of SOMAT SUPRA and a package of CALGONIT ULTRA 


is shown to the respondent):" 








The package used is handy 


(0)1 -■ 1- 1" 


1(7) 


The package used is bulky 


The package used is 






The package used is 


ecologically beneficial 


(O)h-h-h 


-h-h-f— hH(7) 


degrading the environment 


The package used 






The package used 


is decent 


(0)1— h-h 


■■4- 4- H - 1 1(7) 


is flashy 


The package used protects 






The package used does not 


the product well 


(o)hH— h 


-4— h-h-f-H(7) 


protect the product well 


The package used 






The package used 


is appealing 


(o)h-f--l- 


• h ^ f ■ 1 ■' 1(7) 


is not appealing 


The product is 






The product is not 


easily available 


(0)1 1 ■■ 1 


-(-4-H -4 - 1(7) 


easily available 



Figure 1: Selected items of questionnaire 



tecture is profoundly compared to the results of DA. Explored here is the 
potential of artificial neural networks to solve classification problems; typi- 
cally ANN perform well in this problem domain and are reported to achieve 
higher hit ratios on holdout samples than compared methodologies. Our 
comparison takes into account the influence of variations in the clustering 
method used for segmentation, variations in the number of clusters consid- 
ered, and variations in the architecture of the MLP used. Contrary to ANN 
some statistical methods may react very sensitive to the existence of missing 
values. A rigorous clearing of the collected data allows for avoiding these 
kind of problems and accertains the comparability of the final results. Given 
the crossectional nature of the data a subset of 16 cases is deleted due to 
missing values among the variables of interest as well as another set of 10 
observations as it apparently contained out layers. 

An assessment of the model performance by the simple resubstitution method 
is using the same set of data for both estimation of model parameters and 
of model performance. This approach leads to grossly biased results, the 
smaller the sample size of the underlying set of data, the more extreme is 
this over-optimism likely to be. A more reliable evaluation is attained by 
randomly splitting the set of observations into two portions, and then to 
use one portion to fit the model and the other portion to assess its perfor- 
mance. On the other extreme the method of cross-validation — also known 
as leave-one-out method — consists of determining the performance using 
the sample data minus one observation for estimation and subsequently us- 
ing this model to classify the omitted single observation. This procedure is 
repeated by omitting each of the individuals in turn (Krzanowski (1988)). 
The chief drawback of cross-validation is its heavy demand of computations 
necessary, especially in context to ANN, as a new model must be calcu- 
lated each time a new individual is omitted from the training set. As a 
trade off between these two extremes in the present investigation for each 
combination of cluster method and cluster number considered a jack-knife 
is resampling the 75 observations remaining after data clearing into five 
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subsamples labeled subsample In total 35 subsets of data are 

prepared, each splitting up the available observations into a portion of 60 
data points for training plus the belonging validation portion comprising 15 
data points, respectively. The average hit rate across subsample A to E is 
used for evaluation of the different settings (cf. table 3). 

2.3 Segmentation Results 

The formation of apriori segments depends on the individually stated prefer- 
ences for one of several possible alternatives of a closed ended question. For 
observations where several brands are checked by the consumer the classifi- 
cation is based on the brand predominantly chosen as collected in a separate 
but analogous question. Accordingly the sample is split up into 17 consumers 
buying most frequently detergents of the SOMAT-product line v. 58 con- 
sumers usually buying other products. For the three group case the latter is 
broken down to 46 buyers of CALGONIT-products and 12 buyers of other 
detergents. 

For the ANN-based clustering of the data we trained four different SOM 
containing 2x3, 4x6, 8x12, and 32x48 codebook vectors. This type of un- 
supervised learning neural network defines a nonlinear projection from the 
input data space 3?^^ onto a regular two-dimensional array of nodes. With 
every node i in the mapping, a parametric reference vector rui G 3?^ is associ- 
ated. In cases where the map consists of more prototypes than observations 
are contained in the investigated data set, subsets of it will group together 
to approximate the detected clusters. These subsets are used for a better 
visualization of the cluster landscape and to estimate cluster centroids. The 
lattice type of the array can be defined to be rectangular or hexagonal. In 
the course of learning an input vector x E is presented to the network 
and compared to its codebook vectors mi. The best match is defined as re- 
sponse and the input is mapped onto this location. In practical applications 
the smallest of the Euclidian distances is made to define the best matching 
node, signified by the subscript c: 

c = argmin{||a; — mj||}. 

i 

During learning, those nodes that are topographically close in the array up 
to a certain distance will activate each other to learn from the same input. 
The construction of the clusters mapped is done by stochastic approximation 
considering in each iteration step several neighboring nodes in the lattice (cf. 
Bock (1997)). Useful values of the mi can be found as convergence limits of 
the following learning process, whereby the initial values of the mj(0) can 
be randomly chosen {t indicates the progress step in the course of learning, 
cf. Kohonen (1997)): 

mi{t + 1 ) = mi{t) + hci{t)[x{t) - mi{t)]. 

In this relationship hd{t) indicates the so-called neighborhood function, a 
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Cluster 1 

^ pack«9«i« bulky , 

package firofcti w»U 



Ptj^ster 2 



packag* 1i Vacant | 
no! tUractivo 
packag# It 

dtgritdini th« , |;^ 
•nvironintnt I 



cluster 6 



CfMSterS 









package \% 
*handv 



Cluster 

l^ackag#?^^: ♦ 
not dacant 



Cluster 9 



Figure 2: Trained SOM containing eight times 12 codebook vectors 



smoothing kernel defined over the lattice points. Usually hd{t) = h{\\rc — 
rj||,t), where Tc G 5R^ and Vi G are the radius vectors of nodes c and z, 
respectively, in the array. With increasing ||fc — r^H, hd -> 0. The average 
width and form of hd defines the stiffness of the elastic surface to be fitted 
to the data points. Widely applied neighborhood kernels are the bubble 
neighborhood or the Gaussian neighborhood (for more details cf. Kohonen 
(1997)). 

Figure 2 presents a typical solution for an 8x12 map: the relative distances 
between neighboring codebook vectors {i = 96) are represented by shades in 
a gray scale. If the average distance of neighboring is small, a light shade 
is used; and vice versa, dark shades represent large distances. This cluster 
landscape is used to form nine different clusters by visual inspection. For 
characterization of these clusters the positions of six consumer prototypes 
are displayed onto the adjusted map by performing a recall run (cf. figure 2). 
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1 WARD 3 1 


Total 


SOM 3 


SOM 9 


cluster 1 


cluster 2 


cluster 3 


cluster r 


cluster 9 


14 


— 


— 


14 




cluster 5 


7 


— 


— 


7 


cluster 2’ 


cluster 4 


1 


11 


2 


14 




cluster 7 


— 


3 


— 


3 




cluster 8 


— 


4 


— 


4 


cluster 3’ 


cluster 1 


— 


— 


20 


20 




cluster 2 


— 


— 


3 


3 




cluster 3 


1 


1 


2 


4 




cluster 6 


— 


— 


6 


6 


Total 


23 


19 


33 


75 



question 


handy 


ecologically 

beneficial 


decent 


protects 

well 


appealing 


easily 

available 


overall sample 


3.18 


1.04 


3.31 


2.79 


3.08 


1.04 


Ward 


] 


cluster 1 


1.03 


0:87 


3.54 


1.14 


1.74 


0.87 


cluster 2 


1.99 


1.02 


4.30 


5.61 


3.27 


1.10 


cluster 3 


5.35 


1.17 


2.59 


2.31 


3.91 


1.13 


8x12 SOM 




cluster r 


0.94 


0.54 


5.12 


0.53 


1.62 


0.72 


cluster 2’ 


2.34 


1.05 


2.95 


3.83 


2.87 


1.15 


cluster 3’ 


5.93 


1.32 


2.82 


2.43 


05 


1.07 



Table 2: Cluster affiliation and centroids 



As a traditional classification method we use Wards error sums-of-squares 
method. The ellbow criteria indicated reasonable results for an aggregation 
up to five and up to three groups. In the first column of the upper part 
of table 2 the nine group solution represented in figure 2 is condensed also 
to a three group solution indicated by a stroke (cf. table 2). The process 
of topologic ordering mapped consumers with similar perceptions closely 
together {package is bulky v. package protects the product well) whereas 
consumers having different perceptions are projected onto distant codebook 
vectors {bulky package v. handy package). It can be seen from the lower 
part of this table that SOM clusters are similar to clusters formed by Wards 
method with respect to cluster affiliation and to centroids. 

2.4 Classification Results 

The MLP used in our experiments consists of three layers with varying num- 
ber of units in the hidden layer. A MLP 6-2-3 experiment, e.g., uses six neu- 
rons plus one bias node in the input layer, two neurons plus one bias node in 
the hidden layer and three neurons in the output-layer. The initial solution 
for the networks is found by randomly initialising weights and improving 
it by a genetic algorithm approach. Learning of the networks is completed 
for several starting solutions using gradient descent learning and simulated 
annealing to prevent the network from converging to a local minimum of the 
error function. The performance of MLP and DA for all experiments is doc- 
umented in table 3. It indicates for all subsamples considered the number of 
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apriori 2 groups 


SOM 2 groups 




sub- 

sample 


DA 


MLP 

6-1-2 


MLP 

6-2-2 




DA 


MLP 

6-1-2 


MLP 

6-1-2 




A 


5 


8 


7 


10 


11 


11 


B 


10 


11 


9 


13 


13 


13 


C 


12 


11 


10 


15 


13 


13 


D 


13 


11 


12 


12 


11 


11 


E 


11 


11 


12 


13 


12 


12 


sum 


51 


52 


50 


63 


60 


60 


hit rate 


68% 


69% 


__ 


84% 


rw 


— 


Cm^. 


77% 


57% 




apriori 3 groups 


SOM 3 groups 


Ward 3 groups 


sub- 

sample 


DA 


MLP 

6-1-3 


MLP 

6-2-3 


MLP 

6-3-3 


DA 


MLP 

6-1-3 


MLP 

6-2-3 


MLP 

6-3-3 




DA 


MLP 

6-1-3 


MLP 

6-2-3 


MLP 

6-3-3 




A 


4 


7 


6 


6 


13 


6 


13 


12 


15 


7 


13 


12 


B 


4 


10 


7 


6 


14 


12 


12 


11 


13 


11 


14 


11 


C 


8 


9 


8 


8 


14 


13 


10 


13 


15 


7 


14 


14 


D 


10 


9 


8 


6 


13 


8 


7 


7 


14 


12 


12 


12 


E 


6 


8 


7 


8 


14 


8 


8 


12 


15 


8 


8 


8 


sum 


32 


43 


36 


34 


68 


47 


50 


55 


72 


45 


61 


57 


hit rate 


"43^ 


57% 


— 


— 


“9l% 


— 


— 


73% 


"9^ 


— 


81% 


— 


C*max. 


61% 


"44% 


! 44% 






SOM 5 groups 


I Ward 5 groups 


sub- 

sample 




DA 


MLP 

6-1-5 


MLP 

6-2-5 


MLP 

6-3-5 


MLP 

6-4-5 


MLP 

6-5-5 


DA 


MLP 

6-1-5 


MLP 

6-2-5 


MLP 

6-3-5 


MLP 

6-4-5 


MLP 

6-5-5 


A 


15 


7 


7 


7 


7 


11 


10 


1 


4 


5 


5 


5 


B 


14 


2 


4 


2 


3 


6 


13 


2 


8 


4 


6 


4 


C 


14 


2 


5 


4 


8 


4 


15 


3 


3 


3 


6 


5 


D 






13 


3 


3 


1 


3 


3 


14 


2 


1 


4 


1 


1 


E 




13 


3 


5 


4 


5 


4 


14 


2 


8 


6 


8 


8 


sum 


69 


17 


24 


18 


26 


28 


66 


10 


24 


22 


26 


23 


hit rate 






— 


— 


— 


— 




“8^ 


— 


— 


- 


35% 


— 


C*max . 




[31% 


[35% 



Table 3: Classification results in holdout samples 



consumers correctly classified in the respective validation set. The average 
rate of performance is based on the sum of hits across subsamples 
also presented in this table for all experiments. The average hit rate attained 
is added also for every DA and for the best performing ANN. Additionally 
table 2 contains for every cluster method/cluster number combination the 
maximum chance criterion Cmax. as a common (overly pessimistic) bench- 
mark (Morrison (1969)). By comparing the rate of correct classifications 
achieved by DA and the best performing MLP to Cmax. if can be seen that 
both modeling approaches are effective for the same experiments and vice 
versa. A superior performance of ANN can only be observed for experiments 
based on the three groups apriori clustering solution, a slightly better per- 
formance of ANN is attained for the two groups apriori clustering solution. 
For all remaining experiments the results are systematically worse. There- 
fore MLP seem to be sensible to specific group assignments and is reacting 
very flexible to asymmetrical distributions in cluster membership whereas 
DA can obviously cancel out these influences by a more robust behaviour. 
Further all experiments confirm the dominant role of the number of neurons 
in the hidden-layer for the performance of the networks. 
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3 Conclusion 

This contribution compared the use of ANN in market analysis vis-a-vis 
traditional approaches. Our findings can be sumed up as follows: 

1. The most commonly used ANN in marketing are three-layered MLP 
using backpropagation as training method. ANN are frequently applied to 
data sets containing a comparatively small number of observations. 

2. SOM can be used for clustering problems and lead to comparable results 
as found by traditional methods. This can be explained to some extent by 
the similarity of SOM-clustering results to /j-means clustering results when 
the radius of the neighborhood-kernel is set to zero. An advantage of SOM 
is its intuitive representation of the clustering solution. 

3. Based on the presented results a general superiority of MLP classificators 
compared to traditional methods cannot be supported. There is some evi- 
dence that relevant factors in this respect are qualitative in character (e.g., 
problem domain of data, origin of group assignments, level of aggregation). 

4. Additional research may concentrate on the proper incorporation of 
apriori-probabilities into the MLP-modeling approach to consider asym- 
metric distributions in cluster membership. Furthermore a wide range of 
statistical tests in the area of traditional models helps to check necessary 
conditions for optimal procedure (e.g., presence of equal variance-covariance 
matrices in each of the investigated groups for DA) and in assessing the 
quality of the solution (e.g., test of significance of discriminant functions 
by an evaluation of Wilk’s lambda). The apparent lack thereof in context 
with connectionistic models should stimulate further attempts to establish 
a commonly accepted statistical theory for ANN. 
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Abstract: Pricing strategies in marketing suffer from the problem that it is dif- 
ficult to model interdependencies with respect to price decisions of competing 
enterprises. We present an approach which tries to tackle these shortcomings, 
allows for additional insights into the pricing structure of a market, enables a 
classification of different types of competitive pricing schemes and can be incor- 
porated into a profit optimization framework. 



1 Introduction and Selected Concepts of 
Representing Price Competition 

An evaluation of price decisions has to take into account aspects of consumer 
behaviour and objectives of the enterprise under consideration as well as ef- 
fects of competition (e.g., Gijsbrechts (1993)). With respect to the price de- 
pendent aspects of consumer behaviour, e.g., the incorporation of latitudes 
of price acceptance intervalls (e.g., Kalyanaram, Little (1994); Urbany et 
al. (1988)), cherry picking (e.g., Mulhern, Padgett (1995)) or multi-period 
patterns of purchase (e.g., Krishna (1994)) should be mentioned. Concern- 
ing the different objectives of the enterprise under consideration it is often 
recommended to extend pure profit maximization by additional goals, e.g., 
via constraints like lower bounds for market shares or rates of return (e.g., 
Sivakumar (1995)). The incorporation of competitor oriented effects aims 
at a more realistic description of the profitability of managerial decisions 
(Armstrong, Collopy (1996)). The necessity to consider competitive reac- 
tions in pricing strategies is uncontradicted (e.g., Mulhern (1997)) and a 
broad variety of different approaches has been suggested. 

Subsequently some selected concepts are reviewed in more detail. 

Reaction functions have already been used for a long time (e.g., Dolan 
(1981), Leeflang, Wittink (1996)) and can be described as something “which 
determines for a firm in a given time period its action (price and/or quality) 
as a function of the actions of (all) other firms during the preceeding time 
period” (Hanssens et al. (1990), p. 202). However, a deterministic relation- 
ship between the marketing mix variables (especially price) of competing 
enterprises may fail in practical applications (see, e.g.. Natter, Hruschka 
(1996) for corresponding experiences with respect to reaction functions). 
Reaction matrices represent percentual changes of own marketing mix in- 
struments and those of competitors (e.g., Hanssens et al. (1990)) and are 
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useful in differentiating competitive reactions. However, the use of elastici- 
ties imposes restrictions, e.g., to rely on single reaction values for different 
outcomes of changes in marketing mix variables. 

Game theory models have been under consideration (see, e.g., Rao (1993)), 
but already at the beginning of the eigthies the opinion was hold, that such 
approaches are not directly useful for representing competitive behaviour in 
marketing mix optimization models (e.g., Dolan (1981)). 

A recent suggestion is to model competitive price settings by empirical dis- 
tribution functions of prices derived from given data via sampling within 
the framework of independent and identically distributed random variables 
(Natter, Hruschka (1996)). It allows to take into account competitive pricing 
close to the observed data but assumes independent settings of prices over 
time. 

In the following, we analyze competitive pricing structures of markets, pro- 
pose a classification for different types of competitive reactions, and use the 
results of an empirical example for further clarification. 



2 A Parsimonious Description of 
Competitive Pricing Reactions 

In order to tackle situations in which competitive pricing reactions occur, 
we structure the problem into four steps and use the following notation: 



k£K 

b 

I € Sb 
seS 
t € T 

= (r:j) 



index/set of competing brands 
brand under consideration 

index/set of price settings for brand b (e.g., Sb = 

{down, neglectable, up}) 

index/set of price tiers for competing brands (e.g., S = 
{low, medium, high}) 

index/set of time points for which pricing data have been 
collected, e.g., weekly data 

observed matrix of transition probabilities between price 
tiers s, j G 5 for brand ac when the price setting for 
brand b is described by index I 

matrix of transition probabilities calculated by the con- 
stancy/change model 
row s of 



For reasons of parsimony we choose St = {down, neglectable, up} and S = 
{low, medium, high} and will present an empirical example to explain our 
findings later on. 

In a first step, prices Ph,t-i and pt^t for brand b at subsequent time points 
t — 1 and t are used to derive the sets 






( 1 ) 
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^ne_,lecta,le ^ {t ~ , P6,t-1 + $+] }, (2) 

Ml^+ = {tjpb,t > Pb,t-i + (3) 

where > 0 describe bounds for a price range inside which it is as- 

sumed that price alterations for brand b are neglectable for the consideration 
of competitive pricing reactions (see, e.g., Mulhern, Leone (1995) for a sim- 
ilar argumentation). 

In a second step , these sets and which differentiate 

between situations of price reductions (“dou;n”), price increases (“ixp”) and 
neglectable price changes {^^neglectable'') of brand 6, are used to investigate 
the price movements of the competing brands k e K. Corresponding tran- 
sition probability matrices can be empirically derived according to the 
given data. In order to realize a parsimonious representation for the compet- 
itive pricing behaviour, again, the price movements of the competing brands 
are modelled on a more aggregated level taking into account only changes 
within the three price tiers low^ medium^ and high. These price tiers are 
calculated by partitioning the observed price ranges into three intervals of 
approximately the same length. 

In a third step, it is tried to explain the empirically derived transition proba- 
bility matrices within a framework which models price settings as a mixture 
of two opposite types of pricing behaviour: a tendency to remain in the 
last chosen price tier ( “constancy” ) and a tendency to vary price tiers over 
time (“change”). Such a differentiation is close to considerations concerning 
everyday low pricing and high/low-pricing as discussed in the most recent 
literature (e.g. Kopalle, Winer (1996); Lai, Rao (1997)). 

Finally, in a fourth step, the empirically derived matrices are analyzed based 
on what we have called the constancy /change model. The results are used 
for a classification of the competitive pricing strategies under consideration 
according to some prespecified classes. 



3 Classification Possibilities via the 
Constancy/Change Model 

The idea of the constancy /change model is to describe competitive pricing 
behaviour via a convex combination of distribution functions where the “con- 
stancy” situation (i.e. a competing brand k stays in state s E S' independent 
of the behaviour of brand b) is represented by 

. . .) with 5,,- = { J’ otLmise ’ " ^ 

and the “change” situation by 

t • • •) ^ 0 = 1. 

J 



( 5 ) 
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Here, the competing brand ac reacts to the price setting strategy of brand 
b (denoted by 1) according to the distribution Remember that for 

every row state of a transition probability matrix the corresponding row 
describes a probability distribution on the column states of that matrix. In 
the constancy /change model we use the expressions 

= (l - <-') (7f«>')' + a^'% with <■' € [0, 1], Vs G 5, (6) 

to construct a matrix that best fits the empirically derived matrix 

• . •) describes the degree of the mixture of the just explained 
probability distributions. Dependent on the row state s the value of 
helps to interpret the tendency to vary prices over time or to remain in the 
last chosen price tier. 

Formulation (6) includes interesting special cases, aj = 0 leads to 

( 7 ) 

A situation in which formulation (7) is valid for all 5 G S' is equivalent to the 
specification by Natter, Hruschka (1996) if (tt'^’^)' is derived according to the 
empirical distribution of prices, = 1 for all 5 G S leads to the identity 
matrix and refiects the pricing behaviour of a competitor which stays in the 
last chosen price tier. 

The constancy /change model allows for a classification of competitive pricing 
strategies as depicted in Tables 1 and 2 where e”^ i > 0 are suitable chosen 
threshold parameters. 





< 1 - e" i 


P 

V 

1 

rb 


‘‘change'' 
label for t 
dep 


“constancy / change" 
he competitive pricing ; 
lendent on the indices 1 


“constancy" 
situation of k, 
and s 



Table 1: Classification of pricing strategies 



j^K,down ^ j*K^neglectable ^ 

“imitation” 

st' 

j^K,,down ^ j*K,neglectable ^ 

“contra-reaction” 



Table 2: Subclasses of pricing strategies 



The two parameter vectors a”’* and allow for insights concerning inter- 
actions of price settings, a”’' close to one determines the tendency to remain 
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in the price tier of the last period (“constancy”) while close to zero in- 
dicates variation with respect to the prespecified price tiers independent of 
prior price settings for brand b ( “change” ) . Additional specifications of com- 
petitive pricing strategies are possible by I G {down, neglectable, up}. 
In the case 

V/, 

a stochastic ordering according to jf^^down ^ ^K^negiectabie ^ -K,up 
dicate switches within the price tiers of brand k which are in accordance 
with the price movements of brand b\ the tendency to increase prices (i.e. 
to move from to ^^neglectable’^ to “up”) for brand b is accompanied 

by a “similar behaviour” of brand given by the probability distributions 
j-K,down^ j-K,negiectabie ^ j-K,up more probability mass on higher 

price tiers. This tendency of brand k may be named as “imitation” within 
the “constancy /change” situation. The opposite behaviour could be labeled 
as “contra-reaction” . Of course, there is a broad variety of possibilities to 
include additional stochastic relations with respect to j*K,negiectabie ^ 

and into the classification scheme of Table 2. 



4 Empirical Example 

We illustrate our approach based on data of four brands, i.e., K' = K U 
{6} = {/, II, III, and IV], from the area of frequently purchased consumer 
packaged goods. 



clout 

16 T 

14 - 
12 - 
10 -- 



brand III 



6 1 brand II 



2 



brand I 



o4 h 

0 1 



brand IV 



vulnerability 



5 



Figure 1: Competitive positioning of brands I, II, III, and IV. 

The brands account on average for approximately 61% market share during 
the observation period of 101 weeks, i.e., T = {!,..., 101}. Only mar- 
ket shares msi{t) and corresponding brand prices pi{t) were available on a 
weekly basis, i E K\ t £ T. In the observation period different types of 
pricing behaviour (leader /follower situations, cooperative/non-cooperative 
price settings) occured. 
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The competitive structure of the market can be illustrated using the clout 
and vulnerability values of the different brands as shown in Figure 1 where 
clout Ci := YljeK'\{i} vulnerability Vi := brand i can 

be computed from the average cross price elasticities Cij := 
with cross price elasticities Cij{t) = ms%) calculated on a weekly basis. 

Brand IV has an interesting positioning: the of brand IV is very low and 
indicates only marginal opportunities to affect competitive market shares 
by own price settings. On the other side, the vulnerability of brand IV is 
relatively high which underlines that the market share of this brand may be 
influenced by competitive pricing decisions in a serious manner. Therefore, 
the following presentation of our classification approach will be illustrated 
with focus on brand IV, i.e., b = IV and K = {/, //, ///}. 

Due to space limitations for this paper the application of the constancy /chan- 
ge model is discussed only for brand n — III which, too, has an interesting 
positioning in the competitive environment. 



5 Results and Classification of Pricing 
Strategies 



Although the above formulated approach allows for arbitrary non-negative 
in this paper -for convenience- we have chosen = 0. After 

having calculated the sets and in the first step, 

the transition probability matrices I G Sjv, and the corresponding 

matrices I G Sjy^ are determined in the second and third step. 

Of course, one cannot expect that a perfect fit between Q— and i?— matrices 
can be obtained. However, this happened for Q^^I^own j^iii.down 
addition, the results with respect to brand III show that the empirical 
matrices can be represented quite well by the analytical matrices 
(see Table 3 for details). Corresponding calculations are available for brands 
I and II but not presented. 

The parameter vectors and I G Sjv, are estimated by a maximum 
likelihood approach. One gets the orders 



Ill, down 
ai 


> 


^I 1 1 ,neglectable 


> 


III, up 
ai 


III, down 
«2 


> 


^I 1 1 ,neglectable 


> 


III, up 
Ot2 






^I 1 1 ,neglectable 


> 


^ III, up 
as 



and the stochastic order 

j^I 1 1 ydown ^ j*I 1 1 ^neglectable ^ j^III,up 



( 8 ) 

(9) 



According to the estimation results for I G S'jy, given in Table 3, the 

pricing strategy of brand III as reaction to brand b = IV belongs to the 
class of constancy /change^^ (see Table I). 
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Q 



/ 1.000 
( 0.000 
V 0.000 



Ill^down 



qI 1 1 ,neglectable 



0.000 


0.000 \ 


/ 0.917 


0.083 


0.000 


0.889 


0.111 1 


0.000 


0.667 


0.333 


0.056 


0.944 J 


V 0.000 


0.000 


1.000 



j^I 1 1 ,down 



j^I 1 1 ,neglectable 



( 1.000 0.000 0.000 \ 

( 0.000 0.889 0.111 I 

V 0.000 0.056 0.944 / 



/ 0.917 0.027 0.056 

0.000 0.708 0.292 

V 0.000 0.000 1.000 



II , down 



^I 1 1 ^neglectable 





QlII,up 



0.909 0.091 0.000 \ 

0.091 0.818 0.091 

0.000 0.080 0.920 J 



j^III,up 



0.909 0.082 0.008 \ 

0.076 0.841 0.083 

0.007 0.073 0.920 J 






( 0.902 \ 
0.000 ) 
V 0.913 ) 



j^III.up 



/ 0.076 \ 
0.841 

V 0.083 / 



Table, 3: Results of the constancy /change model 



Inequalities (8) indicate, that the “constancy” proportion of the competitive 
pricing strategy of brand III is decreasing when brand IV moves to “higher” 
price settings. The stochastic order of I G as given in equation 

(9), allows for a more refined classification (see Table 2). Brand III obeys a 
pricing strategy of “contra-reaction” within the “constancy /change” class. 
This behaviour can be detected quite well from Figure 2 in which 26 weeks 
of the empirical price paths of brands III and IV from the middle of the 
observation period are depicted. Simple linear regression results proved to be 
not very useful: either the goodness of fit was dissatisfying or the estimated 
parameters were not statistically significant, although the regression results 
indicated a tendency for an interpretation which the classification approach 
of this paper has very clearly detected. 




Figure 2: Price paths of brands III and IV (weeks 42 to 67). 
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6 Conclusion and Outlook 

The presented contribution suggests a framework for analyzing and mod- 
elling competitive pricing behaviour. The incorporation of pricing interac- 
tions via an a priori classification of the price settings of the brand under 
consideration and the price tiers of the competing brands is of key inter- 
est. The decomposition of the pricing information within the presented 
constancy /change model allows for additional insights into the general pric- 
ing behaviour of competitors and enables a classification of different types 
of competitive pricing strategies. 

The approach allows an anticipation of competitive pricing reactions within 
a well-based statistical framework. Extensive Monte Carlo-simulations for 
different planning horizons resulted in additional profit gains in the long 
run when taking into account competitive pricing reactions in the explained 
manner. 

Of course, there are questions left for further research, e. g., the incorpora- 
tion of multi product aspects or the consideration of product categories. 
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Abstract: The ’’Self- Organizing (Feature) Map” methodology as proposed by 
Kohonen (1982) is employed in the context of simultaneous competitive market 
structure and segmentation analysis. In a demonstration study using brands 
preferences derived from household panel data, the adaptive algorithm results 
in a mapping of topologically ordered prototypes of brand choice patterns at the 
segment-level. Furthermore, validity aspects are discussed and the results are 
compared with those derived from a more traditional method. 



1 Introduction 

Competitive Market Structure (CMS) analysis refers to the task of deriving 
a configuration of brands in a product class on the basis of their competitive 
relationships. In marketing literature, it is widely accepted to operationalize 
the degree of inter-brand competition as a measure of substitution as per- 
ceived by consumers (cf. DeSarbo et al. (1993)). However, once the data 
analyst wishes to introduce heterogeneity across consumers (e.g., in terms 
of preference and/or consideration set) into the model, CMS turns out to 
be a segment specific concept^ which imposes the issue of deciding about the 
appropriate level of data aggregation and making CMS and market segmen- 
tation analysis to be dependent on each other. 

The paper proceeds as follows: First, a brief outline of contemporary ap- 
proaches to simultaneous CMS/segmentation analysis is provided. Follow- 
ing, the adaptive Self-Organizing (Feature) Map (SOM) methodology ac- 
cording to Kohonen (1982) is adopted to this task in a demonstration study 
using household-level panel data. Finally, validity issues are discussed and 
SOM results are opposed to those emerging from a traditional approach. 



2 Combined CMS and Segmentation Analy- 
sis 

As Grover and Srinivasan (1987) point it out, the utilization of brand choice 
probabilities as segmentation basis turns CMS and market segmentation 
out to be ’’reverse sides of the same analysis”. In fact, the basic difference 
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of these two ’’sides” refers to the kind of how an observed data matrix is 
processed. According to data theoretical terminology, the shape of a data 
matrix is determined by the number of ways (i.e. the dimensions) and the 
number of modes (sets of different entities that represent the ways of the 
matrix; e.g., consumers, brands or brand attributes). From this point of 
view, the two concepts may be considered as (at least) formally identical data 
reduction problems. As illustrated in table 1, the only (formal) difference 
concerns the focused mode for data reduction: 

Conventional approaches to CMS analysis reduce the brands mode only. 
They typically result in either a ”non-spatial” arrangement of ultramet- 
ric trees, overlapping or fuzzy cluster structures (for a review cf. DeSarbo 
et al. (1993)) or a ’’spatial” representation of brands configurations in a 
geometric space. Spatial models can be further subdivided into ’’composi- 
tional” methods involving reduction of high-dimensional attribute spaces via 
principal components, discriminant or correspondence analysis and ’’decom- 
positional” approaches. The latter are usually based on multidimensional 
scaling (MDS) of respondents’ proximity or dominance statements about 
rival brands. Furthermore, certain unfolding techniques of preference data 
embed the consumers mode as ideal vectors or points in a ’’joint space” of 
consumers and brands configurations (for details see Baier (1994, pp. 32)). 



Mode of 
Data Reduction 


Represent 

Discrete (”Non-Spatial”) 


ation Type 

Geometric (” Spatial” ) 


Brands 


Hierarchical (Tree Models) / 
Non-Hierarchical Classification 




Compositional /Decompositional 
Positioning Analysis 




’Combined’ Approaches 




Subjects 

(Consumers) 




(e.g. LCMDS models) 




A Posteriori 
Market Segmentation 


Preference Scaling 
Models 



Table 1: Synopsis of models for CMS/segmentation analysis 



A posteriori market segmentation^ on the other hand, refers to a compression 
of the consumers mode, which is usually achieved via clustering or latent 
class techniques (Wedel and Kamakura (1998) provide an up-to-date review). 
In contrast, approaches for combined CMS/segmentation analysis reduce the 
consumers and brands mode simultaneously in one single model. Propos- 
als towards this direction are presented, e.g., by Hruschka (1986), Grover 
and Srinivasan (1987) or Wedel and Steenkamp (1991). Another promis- 
ing stream of modelling efforts equipped with the option of introducing the 
consumer mode into CMS analysis is represented by models for multi-mode 
factor analysis. In an extensive demonstration study Klapper (1998) illus- 
trated that constrained versions of three-way factor analysis are also able to 
uncover asymmetric competitive relationship patterns between rival brands. 
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In a review on ’’Latent Class Multidimensional Scaling” (LCMDS) tech- 
niques, DeSarbo et al. (1994) describe a special class of models for simul- 
taneous CMS/segmentation analysis (see also Bockenholt and Gaul (1989)). 
To illustrate current LCMDS methodology consider the MULTICLUS model 
of DeSarbo et al. (1991): MULTICLUS is designed to process profile or 
dominance data and simultaneously performs MDS and cluster analysis for 
a mixture of conditional multivariate normal distributions. Consumers’ 
heterogeneity is explicitly modelled via maximum likelihood estimates for 
segment-specific ideal vectors in a joint space of brand coordinates. Un- 
fortunately, restrictive input data requirements and parametric model as- 
sumptions cause problems not yet adequately solved by ’’classical” statistical 
methodology. The following section introduces an alternative approach to 
multimode data compression that addresses similar tasks like LCMDS-based 
models, but impose less rigor assumptions on input data conditions. 



3 Methodological Aspects of SOMs 

Similar to principal components analysis or MDS, an SOM network con- 
structs a low-dimensional mapping in order to detect the inherent structure 
of high-dimensional input data in a visually easily inspectable manner. Un- 
like the geometric configurations estimated by LCMDS models, the SOM 
method arrives at a non-linear projection of input data space onto a discrete 
map of topologically ordered units via an adaptive procedure (cf. Kohonen 
(1982, p. 139)). A typical SOM network architecture consists of the fol- 
lowing components: An m-dimensional input layer representing features of 
input vectors Xk — [ooki,Xk 2 , out of a set of A; = (1, ..., A") training 

vectors, an usually two-dimensional competitive layer organized as a grid of 
units Uij^ where i represents the row index and j the column index of a unit 
position in the layer, and an m-dimensional weights vector for each SOM 
unit Uij : Wij = [m^^i, mj^- 2 , 

Since the SOM algorithm is well-documented in the relevant literature (cf. 
Kohonen (1982, or 1995, pp. 78)), the following remarks are confined to some 
specific features of the procedure applied in the forthcoming demonstration 
study (for further details see Reutterer (1997)): While running the iterative 
procedure, at each sequence t = 1, ..., T for a randomly chosen input vector 
Xk a ’’winning” or ’’best matching” unit with associated activity Wcij is 
determined according to the rule: 



lUfc - = min{l|xfc - Wyll} , (1) 

II II IJ 

where the index Cij denotes the location of the winner’s position in the two- 
dimensional layer. Updating of the weights vectors is performed according 
to the following learning rule: 

Wij (i + 1) = Wij (t) + a{t)hc,j{t) [Xfc(i) - Wij (i)] , 



( 2 ) 
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where a{t) denotes the learning rate and a neighborhood of unit u^j 

at time t. Unlike the discrete neighborhood function chosen in a previous 
work by Mazanec (1995), here a smoother kernel-based neighborhood update 
process is controlled by the Gaussian function 



hcij (*) = exp 



[(z - + U - 

2a{ty 



(3) 



where i (j) is the row (column) index of an SOM unit Uij and (cj,Cj) the 
corresponding indices of unit u^j. Both the learning rate and the kernel 
width parameter a{t) in (3) are shrinking functions of time, i.e. a{t),a{t) 

0. The SOM algorithm implemented by the author in MATLAB uses the 
functional form a{t) = a{0){c/{t + c)) with the constant c = T/lOO and 
q;( 0) = 0.1. The maximum number of iterations T is set to 20 epochs (i.e. 
runs through the input data set). Notice that specification of the kernel 
width is crucial for the quality of topological ordering of cluster prototypes. 
Here, a(0) is set to the diameter of the map and decreases with a functional 
form identical with the learning rate but a smoother decay rate of c = T/10. 



4 Empirical Demonstration Study 

4.1 Description of Input Data 

Input data are derived from purchase histories of 781 non-brandloyal panel 
households for nine major brands (six national and three nation-wide dis- 
tributed private labels) accounting for 90 per cent total market share in the 
margarine product class. Households’ ’’zero-order” brand choice probabili- 
ties are used as SOM training data. Thus, each household k is described by 
a nine-component vector Xk> Element Xkm measures the relative purchase 
frequency of brand m = 1, . . . , 9 in a two years period, where J2m ^km = 1- 
Note that consideration set heterogeneity across households (the average set 
contains about three brands) results in a large number of zero-entries. 



4.2 Selected Properties of Final SOM Configurations 

After SOM training, a final classification is determined by identifying the 
’’best matching” unit according to rule (1) during a recall run” through the 
input data set. Analog to other techniques of exploratory data analysis, the 
suitability of SOM formats are determined heuristically, e.g. by examining 
’’goodness of fit” measures against a sequence of shrinking SOM layers. 

Two such measures are listed in table 2: As a heterogeneity measure^ the 
’’Mean Squared Errors” (MSE) inform about relative distances between data 
points and corresponding prototype vectors wij. Since more and more dis- 
similar preference patterns are resembled together, total heterogeneity of 
cluster solutions should increase with decreasing numbers of prototypes. The 




524 



SOM layer 


MSE 


MSIPD 


VAF 


CVAF 


A. Rand 
(total) 


A. Rand 
(split half) 


5x5 


0.1046 


0.0599 


0.7300 


0.6105 


0.7440 


0.6702 


4x5 


0.0953 


0.0880 


0.7429 


0.6377 


0.7566 


0.6598 


4x4 


0.1065 


0.0990 


0.7003 


0.6071 


0.7179 


0.6294 


3x4 


0.1280 


0.1320 


0.6245 


0.5696 


0.7431 


0.6247 


3x3 


0.1477 


0.1468 


0.5233 


0.4861 


0.7138 


0.6157 


2x3 


0.1652 


0.2604 


0.4950 


0.4748 


0.7300 


0.6074 


2x2 


0.1822 


0.3696 


0.4366 


0.4355 


0.9406 


0.7797 


1 X 2 


0.2580 


0.2910 


0.2575 


0.2569 


0.9890 


0.9277 



Table 2: Heterogeneity, similarity, VAF and replication values of SOMs 



simplicity measure summarizes deviations of final prototypes from the pre- 
specified grid of units and is computed as ’’Mean Squared Inter-Prototype 
Distances” (MSIPD) for adjacent SOM units, i.e. distances between each 
prototype vector wij and it’s immediately adjacent units Wij^i. For com- 
parison across different grid formats, the sum of distances is divided by the 
total number of possible neighborhood relations. The small similarity val- 
ues in table 2 for layers with 16 and more units indicate ’’good” topological 
representations of input data. However, the topological quality of those 
maps is achieved at the cost of the centroid property of prototype vectors. 
Table 2 provides evidence for this trade-off relationship by comparing the 
” Variance- Accounted-For” (VAF) statistics of the partitions (within groups 
divided by total variance) with a ’’Corrected VAF” (CVAF) measure ad- 
justed for deviations of prototypes from respective class means (which again 
is inversely related to MSE): The spread between these two measures of total 
data recovery increases with improved topological quality of the map. 

The issue of cluster validity is addressed in two ways here: First, as a measure 
of partition agreement the Hubert and Arabie (1985) adjusted Rand-Index 
(’A. Rand’) is computed between each pair of 30 replications of SOM training 
for the total sample. The considerably high averages reported in table 2 
suggest that SOMs seem to be insensitive to random initialization effects 
(the unadjusted Rand values reach values close to unity). Second, for each 
of the indicated SOM layers separate replication analyses as proposed by 
Milligan (1996, p. 368) are conducted for 30 split-half random samples. Of 
course, mean adjusted Rand values do not reach those for the total sample. 
They, however, achieve levels (unadjusted Rand values are all larger than 
0.9) that indicate strong support for the assumption of SOM results stability. 



4.3 Comparison of SOM with MULTICLUS Results 

To illustrate peculiarities of SOM results, consider first a two-dimensional 
mapping of brand positions (1-9) and segment vectors estimated by the 
MULTICLUS model. The minimum AIC statistic occurs for a solution with 
four clusters (VAF = 0.351) designated by the letters A-D and posterior 
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InL: -16829.99 
AIC: 33709.9S 
VAF: 0,351 

rcl. seg size: 

A: 18 36 % 

E: 13.49 % 
C:50J4% 

D: 17 .76 % 



Abbrev: 1 Rsma, 2 Os^na. 3 4 Feibe Th$a.5 ViU. 

& 7 Du d ar 3 Dallas an Noi mal 3 Ballasan Oiat 



Figure 1: MULTICLUS joint space of brand points and segment vectors 



mixing proportions (relative segment sizes) as shown in figure 1 . 

The probably most notably feature of the resulting configuration seems to 
be the outstanding position of the leading (about 30 per cent market share) 
brand 1. Since the vector of segment A is positioned closest to brand 1, high 
mean choice probabilities of about 0.75 are registered in this segment. Clus- 
ter C, on the other hand, may be characterized by strong outlet and product 
usage preferences (high average purchase rates for the private labels 2, 8, 9 
and the cooking/baking margarine brands 3 and 4), while preferences for 
diet brand 7 are above-average for segment B members. The interpretation 
of segment D seems to be less obvious. 

In contrast to the geometric MULTICLUS model, SOMs provide only ordinal 
adjacency information about various brand preferences along the discrete 
map of prototypes, which denote competitive relations between brands on a 
market segment-level. According to table 2, MSIPD scores ’’level off” when 
reducing an 3 x 3 format to an 2 x 3 SOM layer. Therefore, we turn to the 
3x3 SOM solution as depicted in figure 2 for interpretation. 

Since each of the nine prototypes represents one segment of households (the 
relative sizes are indicated in brackets), it is the combination of weights 
values (magnitudes of the columns) that denotes the competition intensity 
among rival brands in a specific submarket: Segments with distinctive brand 
choice patterns (i.e. different types of CMS) are positioned in different di- 
rections or corners of the competitive map; e.g., households with relatively 
strong brand preference for the general-purpose brand 1 are resembled in 
segment no. 1 (19.3 %), while store loyal submarkets for private labels 
(brands 2, 8, 9) are positioned in the map’s opposite corner. Segment no. 3 
is characterized by above-average preferences towards the diet brands 6 and 
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Figure 2: 3 x 3 competitive map with prototype profiles 



7. Purchase behavior of segment no. 7 members seems to be brand usage 
dominated (brands 3 and 4 are traditionally used for cooking and baking). 
Other segments of smaller size represent ’’mixtures” of their adjacent proto- 
types. Finally, low weights values of a brand throughout the map indicate 
a ’’fuzzy” preference position that might call for repositioning actions. 



5 Concluding Remarks 

The present paper documents that the effect of neighborhood updating 
makes SOM methodology accessible for combined CMS/segmentation anal- 
ysis. Using disaggregate brand choice probabilities, SOMs are shown to be 
capable to simultaneously reduce both the consumer and the brand mode of 
input data in a discrete map of topologically ordered prototypes denoting the 
condensed rivality between brands on a segment-level. As previously shown 
by Mazanec (1995) for three-way data, the non-parametric nature of SOM 
analysis also allows for compression of binary profile data. Furthermore, 
since only local information is required at each iteration, SOM training can 
be performed for data sets of unlimited size and/or ’’online” for continuously 
incoming data, such as retail scanner data. Together with a more flexible 
neighborhood updating mechanism that prevents prototypes from deviations 
of their centroid-property, SOM related approaches thus seem to be worth- 
while for further applications in multi-mode marketing data analysis. 
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Abstract: The “analysis of a priori defined groups” (Aaker et al. (1995)) is an 
important task of applied market research, e.g., with respect to consumer seg- 
mentation. Powerful tools for dealing with this kind of problems are provided 
by methods of discriminant analysis. A main objective of this paper is the in- 
vestigation of both past and future importance of this “traditional” approach for 
analyzing a priori defined groups in applied market research. 



1 Introduction 

The analysis of a priori defined groups is marked by a fixed number of 
mutually exclusive groups with fixed size and composition. This is typi- 
cally determined by the underlying investigation itself and not by means of 
methodology. According to McLachlan (1992) the “traditional” method for 
analyzing a priori defined groups is discriminant analysis (tracing back to 
Fisher (1936)). Within the current decade the use of neural networks as an 
alternative tool for such analyses has become of special interest in market- 
ing and management science. Several comparisons of the predictive power 
of conventional discriminant analysis and neural networks presented in the 
appropriate literature seem to give some evidence for a certain superiority 
of the latter (Krycha, Wagner (1997)). Starting from this, five basic ques- 
tions seem to be worth looking at in detail in order to investigate the use of 
discriminant analysis in applied market research: 

A To what extent has discriminant analysis been applied in market re- 
search compared to other usual methods of multivariate data analysis? 

B How are problems of application treated by market researchers? 

C What are the main areas of use in marketing? 

D To what extent are neural networks used to solve discrimination prob- 
lems in marketing? 

E Which new areas of the analysis of a priori defined groups in marketing 
become apparent? 



For both economic and scientific reasons we decided to provide answers to 
these questions by following the usual principle of first evaluating suitable 
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secondary data sources and then by elaborating the remaining issues through 
an own additional primary research (Malhotra (1996)). This was especially 
useful in the present context because the market research community repre- 
sented by both secondary and primary sources, namely market researchers 
in practice (e.g., in business and research institutes) and science (e.g., pub- 
lishing in related journals), is more “balanced” than one exclusively drawn 
from primary data sources. 



2 Findings from Secondary Data Analysis 

In the mid seventies Greenberg et al. (1977) investigated the determina- 
tion of techniques and methodologies being used by business firms engaged 
in marketing research activities. According to the companies being inter- 
viewed in this study the extend of applications of discriminant analysis was 
relatively low (rank ten among the twelve methods to be considered). Ex- 
cept for consumer goods industry (rank eight) this was right for almost all 
the different types of companies, too. With respect to the main areas of 
use of discriminant analysis in marketing the relevant branches can be di- 
vided into two groups: The proportions of applications in consumer goods 
industry, consumer and industrial goods industry, utilities as well as market- 
ing research and consulting were above the total sample proportion of 14 %, 
whereas all the remaining areas were characterized by lower shares. In a sur- 
vey of German market research institutes, carried out by Gaul et al. (1986) 
in the mid eighties, discriminant analysis ranked eighth out of 24 methods 
which were applied at least “seldomly”. In contrast, looking at the propor- 
tion of institutes which at least “sometimes” applied any of these methods 
only led to rank 12 for discriminant analysis. Furthermore, in this study 
discriminant analysis was identified as a “standard multivariate method”. 
Nevertheless, a method-based classification of the research institutes finally 
provided the result that the mean intensity of use of discriminant analysis 
was highest in a segment called “data analysis experts” , who probably had 
no problems in applying this method. 

The use and proliferation of quantitative techniques in scientific marketing 
research from 1964 to 1989 (at 5 year intervals) was examined by Waheeduz- 
zaman, Krampf (1992) applying a content analysis to six leading American 
marketing journals. Discriminant analysis was only used within 20 out of 549 
articles reporting the application of quantitative methods which corresponds 
to rank 8 out of 13 methods under consideration. The main application area 
was consumer behavior, whereas, e.g., no application was noted in pricing, 
distribution, and retailing, respectively. Finally, we had a look at the official 
bibliography of the Marketing Science Institute by Dickinson (1990) which 
contains 106 applications of discriminant analysis published from 1974 to 
1988 and which led us to the following main results: First, the main subject 
areas were consumer behavior and segmentation, advertising and communi- 
cation as well as strategy and competition research (altogether about 76 %). 
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Second, most of the articles (nearly 68 %) were published in journals related 
to the labels “research” or “science”. These findings can be interpreted as 
a certain confirmation of the above mentioned major use of discriminant 
analysis in “consumer behavior research” and by “skilled analysts” . 
Unfortunately, by its definition, secondary research always means to look 
back (Aaker et al. (1995)). Thus, although we found some interesting ten- 
dencies in the past, it seemed to be impossible to answer all the questions 
of interest in this way. 



3 Findings from an Internet Research 

In February 1998 we counted the matches of several internet search engines 
concerning selected methodological (e.g., “cluster analysis” or “discriminant 
analysis”) and marketing-oriented keywords (e.g., “marketing” or “market 
research”). We found that discriminant analysis was the eighth most pop- 
ular method when only methodological keywords were used for search. In 
connection with the marketing keywords “market research” and “consumer 
behavior” discriminant analysis ranked seventh and third, respectively (out 
of 15 methods of interest). Obviously, there seems to be some evidence for 
the findings of our secondary research especially with respect to applications 
in consumer behavior analysis. Unfortunately, the internet research did not 
provide any new areas of the analysis of a priori defined groups in marketing. 
It should also be pointed out that the keyword “neural network” has induced 
comparatively high numbers of matches within the internet research. Never- 
theless, one could argue that this kind of evaluation is a rather exaggerated 
“snapshot” since it cannot be generally supposed that every link actually 
represents a real application of the respective technique. Furthermore, it 
is rather difficult to make any final statements about the reliability of an 
internet research by a “one shot study”. Thus, we decided to conduct an ad- 
ditional review of publications launched within the last ten years in selected 
journals. 



4 Findings from an Actual Literature Re- 
view 

To carry out a review of the relevant literature we first defined 37 criteria 
which v/ere mainly nominally scaled and which had to be recorded from 
selected English and German language publications in marketing and market 
research journals (along with some other journals related to quantitative 
methods in marketing). For analytical purposes these criteria were labeled 
and categorized as follows: 

1. “Subject of Investigation” 

(e.g., “Main Subject Area” or “Type of Observations”), 
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2. “Research Design” 

(e.g., “Sampling Method” or “Data Capture”), 

3. “Type of Discriminant Analysis” 

(e.g., “Mathematical Model” or “Number of Groups”), and 

4. “Interpretation & Software” 

(e.g., “Chance Criteria” or “Cross-validation”). 

Our parent population was defined by all the publications concerning appro- 
priate applications of conventional discriminant analysis in marketing and 
market research from 1988 to 1997. Focussing on this period of time we were 
able to “continue” both the study of Waheeduzzaman, Krampf (1992) and 
our own evaluation of the MSI bibliography. From this set of articles a sam- 
ple was drawn following a two-stage procedure: First, a convenience sample 
of journals of interest was realized. In order not to omit any important 
publications our sample also included journals which were not exclusively 
related to marketing or market research. In a second step only marketing 
and market research oriented publications were screened from these journals 
by means of census. This way about 6,700 articles published in five German 
language and nine English language journals provided a set of merely 61 
articles which seemed to be relevant for a closer examination. 

Starting from the original set of criteria we finally could take into account 
just 24 because only for these criteria useful values could be generated within 
data collection. Nevertheless, also for some of the remaining criteria miss- 
ing values occurred. Possible reasons for the occurrence of missing values 
might be, for example, security reasons on the part of the authors, missing 
computation (e.g., because of methodological problems) or simply errors in 
our data collection (e.g., during the editing procedure). The missing value 
problem was treated as follows: 

• With respect to all criteria we just used the proportions of missing val- 
ues as a first indicator for possible existing methodological problems. 

• Further conclusions were only drawn from those criteria which were 
characterized by an uncritical proportion of missing values. 

It should be noted that numerous details concerning the categories “Research 
Design” , “Type of Discriminant Analysis” , and “Interpretation & Software” 
were predominantly incomplete. One possible reason for this might be a 
more or less incomplete methodological background. In order to perform 
statistical tests for equal proportions as well as multiple correspondence 
analyses all those criteria without any missing values were put together in 
different groups each containing exactly three criteria, respectively. 

The analysis of the first group of criteria (“Focus of Article”, “Journal of 
Publication”, and “Year of Publication”) suggested the supposition that 
most of the papers related to marketing topics can be classified as user- 
oriented. Obviously, those authors addressing themselves to the marketing 
community did not see any need for methodological research. 
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Main Subject Area (<l): 
Advertising 
Consumer Behavior 
International Marketing 
Product Policy 
Segmentation 
Social Marketing 
Strategy & Competition 
Others 

Type of Observations (□): 

Companies 

Persons 

Products 

Products & Persons 
Strategies or Projects 

Type of Predictors (o); 
Company Features 
Costs & Benefits 
Values 
Psycho. 

Socio. 

Socio. Si Psycho. 

Socio. Si Values 
Combinations with Psycho. 



28.574; DF = 7; a = 0.001 



= 46.459; DF = 4; a = 0.001 



39.066; DF = 7; a = 0.001 



Socio. Si Values 
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Segmen- 



Costs Si Benefits ^ tat ion 



Persons 



Advertising 



Consumer Behavior 



Soclralr-Mark^ ing q 



□ Products 



sycho. Others Products 
&: Persons 



Product Policy 



Combinations 
with Psycho. 
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Figure 1: Tests for equal proportions and multiple correspondence analysis 
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Results for the second group of criteria (“Main Subject Area”, “Type of 
Observations”, and “Type of Predictors”) provided by tests for equal pro- 
portions and correspondence analysis are graphically represented in Figure 1. 
From the depicted map we can see that from a marketing theoretical as well 
as from a marketing managerial point of view - depending on the main 
subject areas of discriminant analysis - both observed and predictive vari- 
ables were selected in a behavioristically and/or strategically useful sense. 
Causally expected main subject areas and corresponding predictive variables 
occur in the particularly “right” cloud. The lower cloud, e.g., can be inter- 
preted in such a way that differences between companies or strategies are 
explained by certain company features. Furthermore, the main subject ar- 
eas in the underlying set of articles were strategy and competition research 
as well as international marketing. On the one hand this can be seen as a 
confirmation of obvious presumptions. On the other hand this also seems 
to underline the findings of the analyses of the first group of criteria, in the 
sense that from a marketing user’s point of view discriminant analysis has 
to be considered as a more or less fully developed method. Perhaps this is 
why the authors apparently had no problems in correctly defining objects 
and variables with respect to the corresponding main subject area. 

These findings were also confirmed by the analyses of a third group of crite- 
ria ( “Goal of Data Analysis” , “Part of Discriminant Analysis” and “Further 
Methods”). Using multiple correspondence analysis we were able to discern 
several clouds each of them representing another goal of data analysis. To 
attain these goals combinations of methods were used. In other words, dif- 
ferent methods from multivariate data analysis were combined in a useful 
manner according to the given problem. Nevertheless, discriminant analysis 
was often only used as a supporting approach to complete a given bundle of 
methods. Furthermore, we did not find any application of neural networks 
with respect to the present domain in our set of articles. 



5 Conclusions 

The following general results can be summarized with respect to the ques- 
tions asked in the beginning: 

A Discriminant analysis is widely considered as a standard method of 
multivariate data analysis. At the same time the evaluation of dif- 
ferent sources related to marketing has shown that it seems to be 
comparatively seldom used in applied market research up to now. Ad- 
ditionally, we can state that it was rarely the main method of data 
analysis in our primary study. 

B The evaluation of several secondary sources (especially the study of 
Gaul et al. (1986)) has shown that discriminant analysis is mainly 
used by data analysis experts. This was also supported by some own 
correspondence analyses. 
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C Apart from consumer behavior analysis and segmentation, discrimi- 
nant analysis was mainly used in strategy and competition research as 
well as in product policy. 

D In contrast to the results of our internet research the evaluation of 
appropriate journals suggests that the use of neural networks in the 
given context seems to be still in its infancy. 

E None of the studies referred to in this paper has provided a both use- 
able and comprehensive basis for the deduction of new areas in mar- 
keting which might profit much from the analysis of a priori defined 
groups. The necessity of future research concerning this point is more 
than obvious. 
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Abstract: Since the methods of measuring consumer response to changes in 
marketing mix have been improved successively in the last years, the problem of 
analyzing and forecasting competitive reactions has become one of the most chal- 
lenging topics for model-based scanner data analysis. In this paper a stochastic 
model for the evaluation of competitive reactions is proposed. The adequacy of 
this approach with respect to the analysis of data reflecting market competition 
is exemplarily demonstrated by using point-of-sale scanner data. 



1 Introduction 

1.1 Modeling Competition 

Competition influences almost each aspect of marketing management and is 
therefore reflected in different facets of marketing modeling. For instance, 
the formulation of discrete econometric choice models without the so-called 
’’Independence of Irrelevant Alternatives” property proposed by Luce (1959) 
has been elaborated to permit the calculation of cross elasticities which are 
suited to capture the competitive influence on consumers’ choice (Hruschka 
(1996)). Competitive influences can be integrated into stochastic models of 
consumer behavior as shown by Decker et al. (1997) for zero order consumer 
decision processes and by Carpenter, Lehmann (1985) for higher order deci- 
sion processes. In order to process the corresponding results to management 
decision support, e.g. by performing simulations or ”what-if’ analyses, as- 
sumptions concerning the competitive reactions are indispensable (Cooper, 
Nakanishi (1988); Leeflang, Wittink (1992)). 

An analogous problem of competitive reactions arises in normative market- 
ing modeling. Specific assumptions concerning the competitive reactions 
are needed to optimize the use of marketing instruments, for example no 
reaction, a mutual best response in spirit of a Nash equilibrium solution, 
or some other suitable assumptions. A common characteristic of the re- 
sulting models is the rationality assumption with respect to at least one 
competitor. However, the assumptions of market clearing and permanent 
rational behavior seem to be unsuitable to advance description and analysis 
of real market processes (Engelhardt (1995)). Rubinstein (1995) points out 
the implicit assumptions of the ’’rational-man” paradigm, for instance clear 




537 



problem knowledge, unambiguity of preferences referring to consequences of 
a decision as well as unlimited capacity of formal optimization. Further- 
more, he clarifies the need for alternative models. Particularly management 
decisions in competitive environments are subject to bounded rationality 
because managers usually simplify their decision making by using only se- 
lected information (Deshpande, Gatignon (1994)). Moreover, evaluations of 
competitive reactions using econometric response functions are indicating 
the possibility of over- and under-reactions in the interaction of competitors 
(Leefiang, Wittink (1996); Brodie et al. (1996)). 

The present paper aims at an explanation of management decisions following 
the behavioristic stimulus-response paradigm and using a stochastic model 
of competitive behavior. Management decision making with respect to suit- 
able reactions to extraordinary competitive activities is modeled analogously 
to consumer purchase decision making. The model is based on the so-called 
tit-for-tat strategy, which has been found to be very successful in the simu- 
lation studies of Axelrod (1984). This strategy appears to be an intuitively 
comprehensible tool of reducing complexity of the competitive environment 
and also allows the relinquishment of the assumption of formal optimiza- 
tion by individual competitors. Furthermore, the following stochastic model 
offers the possibility of integrating aggressive marketing measures. 



1.2 Data Description 

Leefiang and Wittink (1992) point out, that retail scanner data may be a 
useful basis for analyzing competitive reactions. In the following, a data 
set describing a selected market segment of a frequently bought product of 
everyday use which has been kindly provided by the IRI/GfK Retail Service 
GmbH (Niirnberg) will be employed. Since different varieties of products are 
characterized by identical prices as well as by an identical use of marketing 
instruments, the following contemplation can be concentrated on the manu- 
facturer level. Altogether, 38 manufacturers are competing with each other 
in the considered market, but more than 70 % of the sales have been realized 
by three leading manufacturers. The data was collected from 119 German 
outlets of at least 800 m^ over a period of 46 weeks. Most of these outlets 
belong to one of six leading retailing chains. Both, the assortment and the 
success of individual manufacturers differ from the outlets of different chains 
as well as from the outlets belonging to the same chain. Moreover, hetero- 
geneity with respect to the individual usage of marketing instruments seems 
to be sufficiently reflected in the data. The use of instore promotions is 
exemplarily shown in figure 1. Obviously, there are fluctuations in the usage 
of instruments over time. However, more important for modeling compet- 
itive reactions is the point that the number of outlets of individual chains 
which put the instruments into practice fluctuates. This is indicated by the 
varying lengths of the corresponding bars in the chart. As one can see, the 
conjecture, that all the outlets belonging to one chain would simultaneously 
promote the products of a special manufacturer, has to be rejected. 
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Figure 1: Use of additional instore displays by manufacturer 2 



A basic advantage of the stochastic model to be described in the next section 
is the explicit possibility of capturing heterogeneity via special distributions. 
For demonstration purposes three marketing instruments (price discounts, 
additional instore displays and communicative measures) will be taken into 
account. 

2 A Stochastic Model of Competitive 
Behavior 

2.1 Assumptions 

At the retail level competitive reactions can be operationalized via marketing 
mix usage (Leeflang, Wittink (1992, 1996)). The proposed stochastic model 
is mainly based on the following assumptions: 

Assumption 1: Competitive reactions respectively the use of marketing in- 
struments can be represented by a multivariate zero order process implying 
a substantial demarcation to normative approaches concerning the dynamic 
optimization of the use of marketing mix instruments. It should be em- 
phasized that, regarding to the price as one dimension of the multivariate 
process of instrument uses, only the nonmetric scaled price promotions are 
considered, but not the price as itself. For the explanation of the use of one 
single instrument, the assumption of a higher order process has been used in 
previous studies. In contrast to this Muhlern (1997) points out, that price 
promotions occurring randomly over time have become pervasive. In order 
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to evaluate the actual processes of uses of instruments in the current data 
representing a market segment of a product of everyday use, an asymptotic 
multinomial runs test has been calculated. The assumption of a zero order 
process seems to be suitable since it could not be rejected for more than 79 % 
of the processes investigated at the outlet level (a = 0.01). Futhermore, this 
assumption allows the integration of a tit-for-tat strategy, implying cooper- 
ative competitive behavior as long as all competitors behave cooperatively. 
In case of non-cooperative behavior of any supplier (so-called ’’defection”) 
the management of the competitors would respond with a single defection 
by means of forced marketing measures. The competitors of interest will 
cooperate in the following periods until the next defection occurs (Axelrod 
(1984)). The assumption of a process in which the present states (i.e. uses of 
instruments in the outlets) are not a function of former states for each sup- 
plier permits the integration of tit-for-tat strategies for individual suppliers, 
since their future behavior depends on competitors’ present behavior but not 
on their own present behavior. Moreover, especially so-called ’’aggressive” 
marketing measures can be captured by a zero order process model since 
they are neither a reaction to former own nor former competitive behavior. 
Assumption 2: The use of a marketing instrument can be represented by 
a counting process. This assumption is due to the fact that retail scanner 
data may provide countings of individual uses of instrument over weeks and 
outlets. Because of the usual binary coding of this type of information the 
trajectories do increase with stepwide one. This leads to the assumption of 
a poisson process for individual uses of marketing instruments (Fahrmeir et 
al. (1981)) which additionally has to be generalized to the multivariate case 
with intensity rates Ximt (i = 1, . . . , / = number of marketing instruments, 
m = 1, . . . , M = number of manufacturers, and t = 1, . . . , T = number 
of periods (weeks)) when more than two marketing instruments have to be 
taken into account. 

Assumption 3: According to the given tit-for-tat hypothesis the intensity 
rates of the current use of marketing instruments by individual manufac- 
turers are assumed to be a function of the competitors’ former instrument 
uses. A peculiarity of competition analysis at the retail level is the fact, 
that any reactions to competitive aggressive marketing measures normally 
are only possible with a certain lag of time, for example, because of the 
necessity of common planning and execution (together with the retail man- 
agement) of corresponding instore promotions. Leeflang and Wittink (1996) 
for example are assuming in a comparable context that the use of marketing 
instruments in week t due to given competitive behavior in weeks t — 2, t — 3, 
and t — 4 may be interpreted as a kind of retailers’ rivalry. On the other 
hand, reactions with a time lag of more than four weeks are assumed to be 
induced by manufacturers. Because of the special structure of usage in the 
outlets of individual chains (cf. figure 1 for this as well) we will abstract from 
such a distinction in the following. Exemplarily, we are focusing on periods 
(weeks) t — 2 and ^ — 3 respectively which were found to be most suitable 
for the present data. Therefore, the intensity rates mentioned above can be 
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formalized in the following way: 

^imt ~ ( 1 ) 

where 6imt denotes some kind of basic intensity of use of marketing instru- 
ment i by manufacturer m in period t and Cim{t- 2 ,t-s) denotes the competi- 
tive influence of periods t — 2 and t — 3 on the actual use of instrument i by 
manufacturer m 

Assumption 4: Heterogeneity with respect to basic intensity Ximt can be 
represented by a multivariate gamma distribution. This Anally leads to a 
Dirichlet distribution of the probabilities of the use of marketing instruments 
in individual outlets (see also Wagner, Taudes (1986)). 



2.2 Polya Model of Competitive Reactions 

Starting from analogous assumptions on the consumer level Wagner, Taudes 
(1986) have derived their well-known Polya model of brand choice and pur- 
chase incidence. In the following we will adapt this approach to the analysis 
of competitive reactions on the retail level. In the present context the model 
is used to represent the probability of the vector of the number of uses of I 
different marketing instruments by manufacturer m in period t. 



• • • 5 ^Imt — ( 2 ) 

T^imt + 6im “ ^ f ^im{t-2,t-3) \ f > 3 

'^imt ) \i^m ^im{t—2^t—3) ) \f^m 4” ^im(t—2,t—3) J 

where: 

= shape parameter of the multivariate Gamma distribution for the use of 
marketing instrument i by manufacturer m 
= scaling parameter of the multivariate Gamma distribution with respect 
to manufacturer m 

Nimt = random variable to represent the number of uses of marketing instru- 
ment i by manufacturer m in period t 
= observed number of uses of marketing instrument i by manufacturer m 
in period t 

As one can see, the competitive influences are captured by explanatory vari- 
able Cim{t- 2 ,t- 3 )’ Following an approach proposed by Decker et al. (1997), a 
multiplicative specification of the corresponding response function has been 
chosen: 




M I 

(^im{t-2,t-3) ~ GXp(q^ im) n fl( + rijrn,t-3/^”' y 3 (3) 

^=1 j=l 

m^m 
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where: 

aim = intercept for the use of marketing instrument i by reacting manu- 
facturer m 

Pjm = parameter to represent the influence of the use of marketing instru- 
ment j by acting competitor m (m = 1, . . . , m — 1, m H- 1, . . . , M) 
njm,t-r = number of uses of marketing instrument j by acting competitor m 
in period t — r 



2.3 Calibration 

The estimation of the models’ parameters may be carried out using the 
following log-likelihood function: 



InL(^) 



'y ^ ^vm 

v=l 




“f" Sim) \ 

+ l)r(5i^) J 

'' 

■^imt 



~^Sim In 



4“ Qm(t-2,t-3) 



+ riimt In 



Qm(i— 2,t— 3) 

Mm 4“ ^im{t—2,t—3) 



( 4 ) 



V m 



where a^m represents the number of occurrences of the u-th combination. In 
order to simplify numerical calculations Aimt can be reformulated as follows: 



A 



imt 



E In 



/=0 



'^imt 4" Sim 1 i 



/ 4" 1 



V i^m^t 



( 5 ) 



2.4 Selected Empirical Results 

To get an impression of the attained modeling fit, the use of marketing in- 
strument 2 (additional instore displays) is depicted in figure 2 with respect 
to the three leading manufacturers. Following Dekimpe, Hanssens (1997) 
a distinction between ’’business as usual” and (re-) active use of marketing 
instruments can be made for interpretation purposes. Regarding periods 
t = 10 up to t = 20 an initial increase in use of instrument 2 by manu- 
facturer 2 can be made out. Since this is apparently not a reaction to an 
extraordinary use of marketing instruments by the other manufacturers it 
can be interpreted as a kind of aggressive marketing measure. The reactive 
uses of instrument 2 by manufacturer 1 and manufacturer 3 in the following 
weeks can be made out clearly and were quite well fitted by the model. An 
analogous result is also provided for period t = 30 up to t = 40, where 
manufacturer 2 but not manufacturer 1 responds to an aggressive measure 
of manufacturer 3 in week 29. In the following manufacturer 1 reacts to the 
exceptional use of instruments by manufacturer 2. This seems to refer to 
asymmetric relations in competitive behavior. The seemingly underestima- 
tion of relevant reactive measures in the latter periods might refer to the 




542 



consideration of alternative reaction hypotheses by changing or enriching 
the specification of equation 3. 

In addition to the adjusted likelihood ratio as a possible measure to eval- 
uate the overall modeling fit Theil’s inequality coefficient seems to be an 
appropriate criterion to assess model based data reproduction as well (cf. 
Pindyck, Rubinfeld (1991)). Both criteria do satisfactorily confirm the first 
impression of an adequate face validity of the polya model with respect to 
the analysis of competitive reactions. With regard to the visual inspection 
of the model’s fit, it has to be emphasized that only one dimension of the 
multivariate process is reflected in figure 2. 



^21 1 



Manufacturer 1 






Manufacturer 2 




Manufacturer 3 

^23t 





— real behavior 

modeled behavior 



m 


1 


2 


3 




0,55 


0,86 


0,82 


Um 


0,27 


0,11 


0,14 



Figure 2: Model fit with respect to the use of additional instore displays 
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3 Conclusions and Outlook 

We have evaluated the suitability of a stochastic model which is well-known 
from individual consumer behavior analysis as a new tool to analyze the 
competitive use of marketing instruments from a retailer’s point of view using 
real point-of-sale scanner data. Assuming a zero order process we were able 
to integrate a tit-for-tat hypothesis concerning individual behavior. Further 
research will concentrate, e.g. on the development of an extended approach 
to additionally take into account asymmetric competition. Finally, further 
applications of the given approach to different consumer goods markets seem 
to be necessary to consolidate these first encouraging results. 
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1 Statistics for Archaeology 

Statistical methods now form an important part of the interpretative tool 
kit of archaeologists. Of these the most common are descriptive statistical 
methods such as: means and standard deviations, medians and modes, his- 
tograms, pie charts, line graphs, etc. It is increasingly common, however, 
for archaeologists and their co-workers (such as physicists, chemists, geol- 
ogists and environmental scientists) to adopt model-based statistical tools. 
Such tools use mathematical representations of the processes which gave 
rise to the data we observe today and help us to begin to understand them. 
Statistical models are particularly relevant for aiding in archaeological in- 
terpretation as they allow us to include sources of uncertainty that are so 
often present in our understanding of the archaeological record. 

Most statisticians use models, here we focus on how Bayesian statisticians 
use models. Bayesian statisticians believe that a priori knowledge about the 
properties (parameters) of the model is important and valuable and should 
be considered along with the model and the data. Such prior information 
is usually only available from experts in the field in question. Indeed, most 
archaeologists consider themselves to be experts in particular areas of study 
and see prior information from those areas as forming an essential part of any 
archaeological data interpretation. Traditionally, however, finding explicit 
ways of allowing for prior knowledge has been difficult and it is here that 
the Bayesian paradigm is proving useful. 



2 The Bayesian Approach 

The Bayesian approach has at its core the work of Rev. Thomas Bayes c. 
1702 to 1761. His most famous work was published posthumously in 1763 
(Bayes, 1763). Only since about 1960, however, has the Bayesian statistical 
framework been seen as a feasible tool for interpretation and only since the 
1980s have suitable computational tools been available to allow its applica- 
tion to real archaeological problems. For a discussion of the philosophical 
framework and a recent perspective on the advantages and disadvantages of 
the Bayesian approach see Howson and Urbach (1993) and for consideration 
of the issues from an archaeological perspective see Buck et al. (1996). 
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Bayes’ theorem is intuitively very simple. Consider data we have collected, 
y, and parameters of a model, 9. The parameters are from a model about 
which we wish to learn using the information that these data contain, but 
also allowing for any relevant prior information. In its basic form, Bayes’ 
theorem then has three components which need exploration: the likelihood^ 
the prior^ and the posterior. 

The likelihood is a statistical function whose form is determined by the model 
we are using but which in general terms can be denoted diS P{y | 9). In simple 
terms, we can think of this as a statement of how likely the observed data 
values are given some specific values of the unknown parameters. 

The prior is also a function and can be denoted by P{9). In simple terms, 
we can think of this as how much belief we attach to these specified values 
of the unknown parameters before {a priori) we observe the data. 

The posterior is what we want to obtain (a combination of the information 
contained in the data, the model and the prior) and can be denoted by 
P{9 \ y). In simple terms, we can think of this as the amount of belief we 
attach to these specified values of the unknown parameters after observing 
the data. 

Bayes’ theorem relates these three components thus 



Posterior oc likelihood x prior 



or 



P{9\y)cxP{y\9)xP{9). 

Thus, Bayes’ theorem provides a mechanism for obtaining a posteriori in- 
formation about the parameter values of interest based upon the data, the 
model and the prior. 



3 Why This is Useful for Archaeologists 

Archaeologists routinely base their interpretations on a wide range of sources 
of information. The Bayesian paradigm provides us with a coherent and ex- 
plicit framework for combining such information to arrive at interpretations 
which refiect as much of current knowledge as possible. 

Once we have developed a model and used it to investigate one data set, the 
Bayesian framework provides us with a mechanism to learn by experience. 
The posterior we obtain today becomes the prior we adopt the next time we 
obtain more data of the same type. This way we build on what we learn at 
each stage in our research rather than treating each investigation as a ‘one 
off’. 
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4 An Illustrative Example: Radiocarbon 
Dating 

Consider just one archaeological event with calendar date 6 BP. Associated 
with 0 is a unique radiocarbon age which represents the amount of in 
the sample. The unique radiocarbon age is represented by 
Unfortunately, due to the nature of archaeological sampling and to the mea- 
surement techniques used, /i(0) cannot be obtained accurately. In fact, we 
obtain an observation, x, which is an estimate of x is modelled as a 

realisation of a random variable X where 

X = fj,{6) + noise. 

Completing the notation, a typical radiocarbon determination is represented 
by X dz a (for example 3600 ± 60). The exact form of /i(0) is not known, 
but there is an internationally agreed calibration curve which is accepted 
as a good estimate (for details see Bowman, 1990). The noise (or error) 
is estimated by the radiocarbon laboratory and is assumed normal with 
mean zero and standard deviation a. The error, cr, is an essential part of 
the radiocarbon determination and is quoted by the radiocarbon laboratory 
along with the estimated radiocarbon age (for details see Bowman, 1990). 
So that 

Consequently, we have a model that relates radiocarbon determinations, 
X ± (7, to the date at which the sample stopped metabolising, 6 BP. For this 
problem, Bayes’ theorem can be expressed as 

P{6 I x) oc P(x 1 9) X P{9) 

and we require that prior information be expressed in a suitable form. 
Traditionally, it is assumed that there is no prior information about 9. In 
other words all possible calibrated dates (in the range of the calibration 
curve) were considered as being equally likely true dates for the sample 
under investigation. There are, however, situations in which really useful 
prior information does exist. Suppose, for example, that we know from 
historical evidence that 9 must have occurred between two calendar dates, 
and ^2 BP. Then, 

^i> 9 >^2 

and, if no date in the range is more likely than any other, a priori all the 
dates in the range are equally likely. All dates outside the range have zero 
probability. 

Another common type of prior information is that one event must be later 
(or earlier) than another. For example, if 9i BP and 02 BP are the unknown 
calendar dates of two events, we might have stratigraphic information which 
tells us that 



9 1 > 02 - 
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Now, let xi ± (7i and X 2 ± ct 2 be two radiocarbon determinations associated 
with the two events we wish to date. The posterior distribution for 6i and 
02 is given by 



P{01,62 I Xi,X2) oc P(Xi,X2 I 01,02) X P( 0 i, 62). 



In this situation, we evaluate 



P{01,02 I Xi,X2) 



over a grid of 9i and 02 values noting that it is zero if 0i > 02 - 
Of course, the situation just described is very simple because it involves just 
two events. More realistic and commonly occurring scenarios will comprise 
many events with possibly quite complex interrelationships. An real example 
of such a problem follows. 



4.1 St. Veit-Klinglberg, Austria 

Typically, when undertaking radiocarbon dating, archaeologists are not sim- 
ply interested in the calendar dates of the samples submitted for dating, but 
would also like to be able to tackle other chronological issues too. Examples 
include, ‘what is the chronological sequence of these events?’, ‘given that 
stratigraphy provides a chronological sequence for some of the events, can 
other events be placed within it?’ and ‘if we have chronologies of events from 
two different sites, can we relate the dates of events on one site to those on 
the other?’. 

Buck et al. (1994b) describe the application of Bayesian methods to the 
analysis of fifteen radiocarbon determinations from the early Bronze Age 
settlement of St. Veit-Klinglberg, Austria. During the excavation archaeolo- 
gists collected samples for radiocarbon dating. Using stratigraphic evidence, 
it was possibleto order the calendar dates of ten of the events. Let 0i denote 
the calendar date of context i (where i is the context number allocated at 
the time of excavation) then the archaeological information can be expressed 
in the form of the following inequalities. 

^758 > ^814 > ^1235 > ^358 > ^813 > ^ 1210 ? 



^493 > ^ 358 , 
^925 > ^923 > ^ 358 ) 
^1168 > ^ 358 - 



With no other archaeological knowledge, the 0iS are assumed to have a uni- 
form prior subject to these constraints. Thus the posterior of 9 is given 

by 



1 



p{9\x) oc fj — exp^ - 



(Ti 



{xi - ii{0i)Y 
2cr,-2 
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• what are the calibrated dates for the start and end of each phase? 

• how long did each phase last? 

• how long was the hiatus between phases? 

• where do the calibrated dates for other unplacedevents fit? 

Indeed, the Bayesian radiocarbon calibration framework has now developed 
to such an extent that most types of absolute and relative a priori dating 
evidence can be included in the calibration process. For details of these and 
further approaches to Bayesian radiocarbon calibration we refer the reader 
to Buck et al (1991), Buck et al (1992), Buck et al (1994a), Buck et al 
(1994b), Christen (1994a), Christen (1994b), Litton and Buck (1996), Buck 
et al (1996) and Buck and Christen (1998). 

6 Some Other Applications 

Alongside development of the Bayesian framework for radiocarbon calibra- 
tion, a number of other fields of archaeology have benefited from the appli- 
cation of Bayes’ theorem and modern computing techniques. These include 
the following. 

• Investigations into the length of the megalithic yard were undertaken 
by Freeman (1976). 

• Kadane and Hastorf (1988) used palaeoethnobotanical remains to help 
identify the nature of archaeological activity undertaken at sites in 
Peru. 

• Cavanagh et al (1988) used Bayesian image processing for noisy soil 
phosphate data to distinguish areas of archaeological activity from 
those that showed no signs of human occupation. 

• Buck et al (1993) investigated the structure and stability of prehistoric 
corbelled domes, identifying trends in architectural technique. 

• Buck and Litton (1996) adopted a Bayesian approach to studies of ce- 
ramic provenance allowing archaeologists to include prior information 
about what constitutes a provenance where it is available. 

• Buck et al (1996) illustrate possibilities for the use of Bayesian ap- 
proaches to seriation and dendrochronology. 



• Lucy et al (1996) outline a Bayesian approach to human age deter- 
mination on the basis of dental observations. 
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Figure 1: Calibrated date distributions for context 1235 at St. Veit- 

Klinglberg (a) when stratigraphic information is accounted for, and (b) when 
it is ignored. 





for values of 0 satisfying the above constraints. 

In Figure 1 we illustrate the nature of the results obtained in investigations 
of this sort by providing plots of the posterior probability of 0i235. In Fig- 
ure 1(a) the stratigraphic information is included in the calibration process 
and in Figure 1(b) it is ignored. We see clearly here the marked effect of the 
stratigraphic information and note that when it is included the calibrated 
date range is 200 years shorter than when it is ignored. 

Computing the posterior probability distributions for the calendar date of 
each event when the relative dates of all the others is to be considered is 
not trivial. We need to work with large number of parameters and the 
complex inequalities that represent the relative chronological information. 
Consequently, in obtaining the information represented in Figure 1, recent 
developments in numerical methods (especially Markov Chain Monte Carlo 
simulation) proved particularly powerful. Details of such methods are given 
in a number of the texts in the bibliography, but see in particular Litton 
and Buck (1996). 

For those particularly interested in the archaeological implications of such 
work. Buck et al (1994b) address questions such as (a) the dates of the ten 
events and (b) the time period between certain specific events. In addition, 
they are able to use radiocarbon dating to place into the chronology several 
deposits which cannot be stratigraphically related to the main sequence. As 
a result, archaeologists now have a firm indication of the occupation periods 
of the site and how localised, stratigraphic chronologies relate to one another. 

5 The Benefits of a Bayesian Approach to 
Radiocarbon Dating 

By adopting a Bayesian framework for radiocarbon calibration, it has been 
possible to rigorously address a range of questions that were previously hard 
to tackle. These include: 
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7 Some Implementation Issues 

Many archaeological problems are statistically non-standard. Off-the-peg 
statistical solutions are rarely adequate and consequently specific and spe- 
cialist models are needed. Close collaboration is necessary between a number 
of specialists if useful models are to be built. 

Many archaeologists and statisticians remain to be convinced that explicit 
inclusion of prior information in the interpretative process is either possible 
or desirable. Even when agreement can be reached, some types of prior in- 
formation are still very hard to specify in a suitable statistical form. For 
those of us convinced of the utility of the Bayesian approach to archaeolog- 
ical data interpretation there is still much work to do in the area of prior 
elicitation and definition. 

Posteriors are established by solving Bayes’ theorem, but the simple form 
of the theorem given above hides a great deal of complexity. Apart from a 
few, simple, special cases, evaluating posteriors requires the use of advanced 
numerical analysis techniques and fast computers. There are now a number 
of elegant algorithms available and these have been widely adopted for the 
work outlined above. A useful summary of what is needed for a wide range 
of example applications is given in Gilks et al (1996). 

8 Looking Ahead 

The application of Bayesian statistics to archaeology is still relatively new. 
There have been some important successful applications and much work is 
still in progress. Some of the techniques that were, until recently, cutting 
edge are now becoming routine. Bayesian radiocarbon calibration, for ex- 
ample, is now quite widely used. A Windows-based package called OxCal 
is already available for undertaking some of the more basic types of calibra- 
tion outlined above (Ramsey, 1995) and a World-Wide Web based facility 
with access to substantial computer power is currently under development 
(http://bcal.cf.ac.uk/). 

Greater understanding of the complexities of each other’s disciplines is still 
required by both statisticians and archaeologists if collaboration is to be 
most fruitful. In addition, statisticians and archaeologists should continue 
to work together and share in new developments in each other’s fields. 
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Abstract: Information retrieval in environmental issues becomes increasingly 
important which can be seen by the claims of the Agenda 21 and the activities 
of the European Topic Centre on Catalogue of Data Sources (ETC/CDS). Since 
thesauri are often used for indexing and retrieval, they play an essential role 
in information systems. One of the most significant examples in Europe is the 
Environmental Thesaurus (Umwelt-Thesaurus) of the Umweltbundesamt. We 
have examined the thesaurus structure and its interplay with the environmental 
database ULIDAT. The results will be presented together with suggestions for 
improvement. 



1 Introduction 

’’Access to information relevant to environment and development” for a 
broad public is a claim of the Agenda 21 for the achievement of sustain- 
able development. One of the crucial points in searching information on 
a certain topic is to find the relevant information. This holds in particu- 
lar for non-experts who have no training in retrieval methods or languages 
and may not have a precise notion of what they are looking for. Full text 
search is often offered as a solution but it turns out that the results are often 
unsatisfying. 

In Germany one of the tasks of the Umweltbundesamt (Federal Environ- 
mental Agency) is to provide environmental information to the public. The 
databases ULIDAT (for environmental literature), UFORDAT (for environ- 
mental research and development projects) and URDB (for environmen- 
tal law) can be accessed via the CD-ROM Umwelt-CD and the database 
host STN (Scientific Technical Network). The Environmental Thesaurus 
(Umwelt-Thesaurus) is used for indexing and retrieval in the databases of 
the Umweltbundesamt. 

In the European Union the Directive of Freedom of Access to Environmen- 
tal Information, adopted 1990, has led to legislation in its member states 
(Kommission der Europaischen Gemeinschaft (1990)). The European Envi- 
ronment Agency (EEA) has been established to provide environmental in- 
formation to the European Community, its member states and to the general 
public. The EEA develops a Catalogue of Data Sources (CDS) of institutions 
with environmental data and uses the General Multilingual Environmental 
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Thesaurus (GEMET) for indexing in the CDS. GEMET has been compiled 
by merging a selection of terms of the Environmental Thesaurus and several 
other thesauri from other countries. Hence the Environmental Thesaurus 
has an important role in finding information on the environment in Germany 
and in Europe. In our investigation we have examined the Environmental 
Thesaurus and its interplay with the database on environmental literature 
ULIDAT. 



2 The Environmental Thesaurus 

The Environmental Thesaurus provides controlled terms for retrieval in sev- 
eral databases offered by the Umweltbundesamt. The thesaurus has a vol- 
ume of 22094 terms that are divided in two classes: There are 8563 controlled 
terms or descriptors and 13531 nondescriptors. For retrieval in a database 
only the descriptors are in charge. For users who are not familiar with the 
thesaurus the so-called nondescriptors are needed to find the appropriate de- 
scriptor or combination of descriptors within the thesaurus. The descriptors 
are ordered in a polyhierarchic broader term-narrower term-relation. We 
may assume that this relation is reflexive, transitive, and anti-symmetric, 
hence, that is an order relation. At the most general level of the hierarchi- 
cal structure there are 501 descriptors. The thesaurus is given by a table 
where for each descriptor links to its narrower terms and its broader terms 
are given. Since the broader term-narrower term-relation is transitive, it is 
sufficient to link only the neighbouring broader and narrower terms. 

We have found two kinds of order inconsistencies. First there are descriptors 
that have a link to a narrower term. But that narrower term lacks the link 
to the corresponding broader term. The descriptor “Betreiberpfiicht” is a 
narrower term of “Umweltschutzauflage”. But vice versa this is not the 
case. The same happens for the narrower term “Natureigenrecht“ of the 
broader term “Mensch-Natur-Verhaeltnis”. Secondly, the Environmental 
Thesaurus has superfluous links. An example is shown in Fig. 1. Since the 
broader term-narrower term-relation is transitive, the link between “Abgas” 
and “Kfz-Abgas” is superfluous since the two terms are already in relation 
via “Verbrennungsabgas” . This phenomenon occurs 633 times within the 
Environmental Thesaurus. Although they do not lead to any contradictions, 
superfluous links make the understanding of the thesaurus structure and its 
modifications more difficult. 

The level of detail in which the thesaurus has been elaborated is varying 
throughout the thesaurus. The longest chain in the thesaurus hierarchy 
from one of the broadest terms to one of the narrowest terms consists of 13 
terms whereas the shortest chain consists of only 2 terms. This is caused 
by the fact that the thesaurus has not been designed as a whole but has 
been extended by terms that were needed for indexing documents (Batschi 
(1994)). 
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Figure 1: Superfluous link between descriptors 



3 Retrieval from the Database ULIDAT 

The purpose of an information retrieval system like the one on the Umwelt- 
CD is that documents can be identifled and located in response to various 
types of user demands. This can be achieved with a detailed and accurate 
knowledge of the information needs of the users to be served (see Lancaster 
(1986)). However in the case of the Umwelt-CD and ULIDAT it is not 
exactly known who are potential users and what are their needs. Hence, in 
our investigation we assumed that the user is a non-expert in information 
science and in the topic he is searching information on. 

The access to ULIDAT provided by STN (Scientiflc Technical Network) re- 
quires sufficient knowledge of the retrieval language Messenger. Messenger 
is not self-explaining and needs training to be used effectively. The use of 
Messenger to take advantage of the Environmental Thesaurus for search- 
ing requires knowledge of the thesaurus and its hierarchical structure. This 
knowledge cannot be presupposed from non-experts. This is the reason why 
we did not investigate the retrieval from ULIDAT provided by STN any 
further. 

In the database ULIDAT on the Umwelt-CD, documents can be searched 
for author, title, publisher, year of publication, abstract (if available) and 
catchwords. The search using catchwords can be supported by the Envi- 
ronmental Thesaurus or the Geo-Thesaurus for geographic units. To each 
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document two types of catchwords have been assigned to describe its con- 
tent: approximately between 10 and 20 descriptors from the thesauri and 
approximately between 3 and 10 free catchwords. By choosing free catch- 
words for a document the indexer has the possibilty to describe the content 
more precisely than the thesaurus allows. For the extension of the Envi- 
ronmental Thesaurus, free catchwords that occur frequently are considered 
as candidates for new descriptors (Batschi (1994)). The free catchwords 
may cause confusion in a search, since there is no possibility to distinguish 
in a query between descriptors and free catchwords. If a user searches for 
a catchword (or a combination of catchwords), the system yields all docu- 
ments that have been assigned this catchword not distinguishing if it was 
found within the descriptors or within the free catchwords. Let us give an 
example where this may cause confusion: The term ’’saurer Regen” is used 
as a free catchword, but the applicable descriptor, concerning this topic, is 
“saurer Niederschlag” . A user who does not know the suitable descriptor will 
use the more spread term “saurer Regen” . As the user cannot know how low 
the recall (the proportion of relevant documents retrieved) is in this case, he 
will not find all documents that are only indexed with the descriptor “saurer 
Niederschlag”. Using only the free catchword in his search he will loose doc- 
uments he might be interested in. If the term ” saurer Regen” had not been 
used for indexing the documents, the inquiry would not lead to any matches 
at all. Since it is very unlikely that there are no documents on this topic 
in ULIDAT, the user may have a look into the thesaurus to find a suitable 
descriptor. Hence it is necessary to enable a user to distinguish between 
descriptors and free catchwords in a search as it is described in Lancaster 
(1986) for hybrid systems that operate on a combination of controlled and 
free terms (cf. Krause (1998)). 

The documents of ULIDAT are indexed using the Principle of Specificity 
(Lancaster (1991, p. 26)): “...a topic should be indexed under the most spe- 
cific term that entirely covers it.” Hence a search with a narrower term 
yields not a subset of the list of literature received by a search with one of 
its broader terms. The advantage of the Principle of Specificity is that it 
allows to distinguish general documents from specific ones. As requested 
in Lancaster (1991, p. 27), it is possible on the Umwelt-CD to perform a 
search on a descriptor and all of its narrower terms. It is even possible to 
select which of the narrower terms should be taken into account. Figure 
2 shows a screen dump of the Umwelt-CD where three narrower terms of 
“Gewerbelaerm” have been selected. 



4 An Alternative Approach 

In this section, we present an alternative method of information retrieval 
using Conceptual Knowledge Processing (Wille, Zickwolff (1994), Stumme, 
Wille (1998), Wille (1999)). The main idea of our approach is to show 
the user a conceptual surrounding of the descriptors used for searching 
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Figure 2: Selection of narrower terms of a descriptor 



(cf. Godert and Lepsky (1997)). Such a conceptual surrounding can be visu- 
alized by a labelled line diagram as the one in Fig. 3. The computer program 
TOSCANA, which uses methods of Conceptual Knowledge Processing, al- 
lows online interaction with large databases and uses labelled line diagrams 
to show networks of conceptual relationships (Vogt, Wille (1995), Kollewe et 
al. (1994)). An implemented TOSCANA-system in a library in Darmstadt 
establishes its usefulness for document retrieval (Rock, Wille (1998)). 

Let us demonstrate how TOSCANA can interact with ULIDAT by an exam- 
ple. Conceptual surroundings of descriptors of the Environmental Thesaurus 
can be visualized by a labelled line diagram as the one in Fig. 3. The as- 
signment of the descriptors to the documents can be read from the diagram. 
The numbers in the labels indicate how many documents from ULIDAT 
have been assigned a certain combination of descriptors. By clicking on the 
numbers on the screen the user can display the titles and the authors of 
the documents. This has been done for one label in Fig. 3, Hence this la- 
bel now shows the title (“Geplanter Ausbau des Mekong”) and the author 
(“Anonym”) of a document in ULIDAT. Let us now explain how to read 
the line diagram in Fig. 3. A document has been assigned a descriptor if 
there is an ascending path of lines from the black node the document has 
been attached to the black node the descriptor has been attached to. For 
example the document with the title “Geplanter Ausbau des Mekong” has 
been assigned the descriptors “Talsperre” and “Staustufe”. To obtain all 
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Figure 3: Line diagram derived from the database about environmental 
literature ULIDAT 



documents that have a conjunction of descriptors we have to do the follow- 
ing. From the nodes of the descriptors we follow descending paths of lines 
to a node that is below all of the descriptor nodes. Now all documents at- 
tached to this node and to all the nodes below it have been assigned the 
conjunction of descriptors. For example, the conjunction of “Stausee” and 
“Talsperre” leads to the the node labelled with “4”. Hence, there are four 
documents that have been assigned “Stausee” and “Talsperre”. To obtain 
the disjunction of “Stausee” and “Talsperre” we have to do the following: 
We have to look at all the nodes that can be reached by descending paths 
from the node of “Stausee” or from the node of “Talsperre” . Hence, there are 
103 (=57-|-24-|-4-|-9+7-t-l-l-l) documents that have been assigned “Stausee” 
or “Talsperre”. It is not necessary for a user to compute sums as we have 
done in this example, because TOSCANA offers two alternative numberings 
for line diagrams. A TOSCANA user can choose between the following two 
numberings for the line diagram shown on the screen: The exact numbers 
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as we have seen in Fig. 3 and the one shown in Fig. 4 where the numbers 
have been added up along descending paths. In Fig. 3, the node with the 
label “Talsperre” has the number 57 attached to it; the two nodes below it 
have the numbers 4 and 9, respectively. In Fig. 4, the node with the label 
“Talsperre” has the number 71 attached to it which is the sum of 57, 4 and 9. 
This adding up of numbers along descending paths allows to locate all doc- 
uments that have been assigned a certain descriptor or one of its narrower 
terms. We can read from Fig. 4 that the descriptor “Staugewaesser” or one 
of its narrower terms has been assigned to 136 documents. The descriptor 
“Stausee” or one of its narrower terms has been assigned to 36 documents. 
The descriptor “Talsperre” or one of its narrower terms has been assigned 
to 71 documents. Now we can again ask: How many documents have been 
assigned “Stausee” or “Talsperre”? The answer is 103 (=36+71-4), since 
there are 4 documents that have been assigned “Stausee” and “Talsperre” . 




Figure 4: Line diagram in Fig. 3 with added up numbers along descending 
paths 
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Now let us turn to the advantages of retrieval with TOSCANA. The informa- 
tion visualized by a labelled line diagram can certainly be obtained with the 
search mechanism on the Umwelt-CD as well. But this will require several 
queries and then the results of these queries cannot be displayed on a single 
screen. In a search where a user does not know exactly what he is looking 
for, a labelled line diagram unfolds the conceptual structure of the topic he is 
interested in. In particular for non-experts, labelled line diagrams might be 
helpful. Our experience with TOSCANA-systems in various applications has 
shown that new users learn to read labelled line diagrams very quickly. In 
our conception of a TOSCANA-system with the Environmental Thesaurus, 
a user can combine several descriptors that have not similar or overlapping 
conceptual surroundings in so-called nested line diagrams (see Vogt, Wille 
(1995) and Kollewe et al. (1994)). Hence TOSCANA allows a rich inter- 
active browsing through the database ULIDAT. TOSCANA-systems that 
interact with thesauri have not yet been implemented but they are outlined 
in Skorsky (1997) and Groh et al. (1998). 
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Abstract; The prediction of the ground ozone is necessary daily, to inform the 
population and to allow measures to be taken for the reduction of the ozone 
concentration at a one-hour exceeding of the ozone value of 180/i^/m^. 

After the presentation and comparison of applied methods of the ozone prognosis 
we develop a fuzzy approach, by which the process of rule formation is supported 
by a special neural network. 

With ground measurements of ozone, nitrogen oxides and the meteorological series 
of temperature, humidity, clouding over and wind from 3 Saxon measuring places 
there are generated and tested optimal knowledge bases by the developed Fuzzy- 
Multilayer Perceptron. 



1 Introduction 

The composition of the earth’s atmosphere is characterized by a dynamic bal- 
ance, which is disturbed by human activities. Fundamental disturbances are 
caused by industrial and automobile exhaust. Consequently the emissions 
of nitrogen oxides and volatile organic compounds can lead to a significant 
increase of the ground ozone concentration. 

However, ozone is an unhealthy substance for humans, animals and plants. 
In the 22nd regulation concerning the enforcement of the federal emission 
protection law there was consequently arranged that at a one-hour exceed- 
ing of the ozone concentration of 180/i^/m^ the population is informed and 
motor vehicles are not used, is far as possible. In order to allow preventive 
measures to be taken for the reduction of the ozone concentration at an ex- 
pected exceeding of the value, the prediction of the ground ozone maximum 
is necessary daily. 

The complex physico-chemical processes which lead to the origin of the 
ground ozone are only partly well-known or the fixing variables of these pro- 
cesses are only observed with disproportionately big expenditure. Therefore, 
each model of the ozone origin only represents an approach to the reality. 
The object of this paper is the testing of chosen methods for their suitability 
to the predication of the ozone concentration. 

As input variables for a model ^i{t ) . . .^n(^) there are available the ground 
measurements of ozone, nitrogen oxides and the meteorological series of tem- 
perature, humidity, clouding over and wind from April to September of the 
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years 1992-1995 from 3 Saxon measuring stations. 

The output variable + 1) describes the prediction of the ozone maximum 
for the following day. 

There are several kinds of modelling, all with the aim to find a functional 
connection for the observed data series. 



2 Applied Methods of the Ozone Prognosis 

2.1 Statistical / Logic Approaches 

At present the Federal Ecological Office uses regression models with a time- 
series-analytical element (ARMAX-models): 

Osm3ix{t T 1) — T U * 03max(^) T ^ * ^max(^) ^ * ^max (i + i) (1) 



where 

03max(^ + 1) 
Tmax 

Tmdix{t + 1 ) 
a, b, c, k 



is the prediction of the ozone maximum for the 
following day 

is the ozone maximum on the same day 

is the temperature maximum on the same day 

is the prediction of the temperature maximum for the 

following day 

are coefficients. 



In Reimer et al. (1996) results of the regression analysis for data series from 
measuring stations in the Saxon region are represented. 

Instead of an analytical description of the problem, the description by sets 
(cluster) can be referred to. Then the aim is to search for suitable classifiers 
with which the observed data can be optimally arranged. 

Apart from the crisp classification, the uncrisp, overlapped definition leads 
to rule-based systems with Fuzzy approach. 



2.2 Determining Approaches 

With the description of the physics and chemical events in the troposphere 
by determining models it is tried to grasp the system behavior and the spa- 
tial distribution of the condition quantities with the help of mathematical 
equations. 

In Flemming (1996) the application of the photo-chemical propagation model 
REM-3 for the short-time prognosis is presented. 

As a result of the needed physical/chemical simplifications and the mathe- 
matical approximations it is not possible, in principle, to develop a model 
which reduces the difference between the measurements and the model pre- 
dictions to ’’white noise”. 

There remains - also with high-complex modelling - a part which must be 
interpreted stochastically. 
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2.3 Neural Networks 

The application of neural nets for the ozone prognosis is regarded as advan- 
tageous by some authors, Becher and Schmidt (1994), Hartmann (1993) and 
Van Praagh (1995), because they require no explicit mathematical model for 
the reproduction of the reality. 

Neural nets are able to derive an implicit model and to learn the complex 
connections from data. 

In Reimer et al. (1996) the use of neural nets is done without because of 
the missing interpretation. 



2.4 Fuzzy Approaches 

Reimer et al. (1996) demonstrates the development of a fuzzy system for 
the ozone prognosis. 

The authors calculate the fuzzy sets of the linguistic variables with a cluster 
method, which is based on the Euclidean metric. 

The fuzzy rules of the prognosis systems are identified with the aid of a 
decision network and of experts (Figure 1). 




Fuzzy rules 



Output variables 

Y (t+1) 

(real value) 

r\ (t+1) 

(prognosis value) 
Z.B.: 03(1+1) 



Figure 1: Problem solution with Fuzzy approach. 



The fuzzy : 


rules are defined as follows: 










i?i : IF 


^1 is 


AND. . .AND 




is Ai^fi^i 


THEN 


V is 


Rk : IF 


6 is 


AND. . .AND 




Is Ai^Yi,k 


THEN 


V is Bj^k 



By Ai^i^k we denote the term Ai of the linguistic variable used in the 
rule Rk and analogously by Bjk the term Bj of the variable rj in the rule k 
{i G {1, . . . s}; j G {1, . . . t}] s^t - number of terms/clusters). 

But the finding of the functional relationship in the form of fuzzy rules is 
also quite difficult for the experts. Therefore, in this paper we propose an 
approach to support the process of rule formation by special neural networks. 




567 



3 The Neuro- Fuzzy Model 

Data mining is to extract useful knowledge from big complex data sets. The 
main focus lies on the identification of patterns, from which rules are de- 
rived. Standard cluster methods such as c-means are generally used in such 
cases. 

The technology of fuzzy sets can considerably contribute to increase the ef- 
fectiveness of the search and analysis procedures in high-dimensional spaces. 
Firstly, fuzzy sets support a focused search for the interesting subspaces un- 
der the assumption of a relevant clustering, e.g. ozone values > 180/x(//m^ 
Secondly, the functional relationships are again obtained in a simple quali- 
tative format and, thus, are easy to interpret. The following approach uses 
the modelling of fuzzy rules and the structure and the learning abilities of 
neural networks. 

3.1 Architecture 

The basis for this model is the transformation of a fuzzy system with a rule 
basis in a neural structure, see also LIN and LEE (1994, 1996). 

Its structure corresponds to a five-layer feed-forward neural network, in 
which only the neurons of successive layers are connected. 




Layer 5 (N 5 ) 

output linguistic neurons 



Layer 4 (N 4 ) 
output term neurons 



Layer 3 (N 3 ) 
rule neurons 



Layer 2 (N 2 ) 
input term neurons 



Layer 1 (N 1 ) 

input linguistic neurons 



Figure 2: Structure of a neuro-fuzzy model with two input variables. 

The neurons of the input layer represent the linguistic input variables 6 • • • 
and the neuron of the output layer the linguistic output variable rj. 

The neurons of the layers 2 and 4 are called term neurons and reproduce the 
membership functions /u and u of the terms Ai and Bj, respectively. 

A term neuron performs a simple membership function (e.g., a bell-shaped 
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function or a triangular-shape function). 

The neurons of the layer 3 represent the fuzzy IF-THEN rules (Figure 2). 
All rule neurons form the knowledge base of the fuzzy system. The an- 
tecedent of a rule is determined by the connections between the input term 
neurons and the respective rule neuron. 

The connection between the rule neuron and the output term neuron defines 
the rule condition. 

A neuro-fuzzy system has connections assessed with more than one weight 
W. These connections exist between the neurons of the input or output 
layer and the corresponding term neurons. So the connection from an input 
neuron n' to an input term neuron n is labeled with the parameters mn\n 
and <7n',n5 where m is the center (main) and a the width (variance) of the 
bell-shaped function for the representation of the term Ai^n of the linguistic 
variable 

An analogous condition is found by the connection between the output neu- 
ron and output term neurons. 

The net input netn and the activation an are calculated for each neuron n: 

(a) for input neurons n E Ni: 

an = netn = eXn (external input function) (2) 

(b) for input term neurons n G A^ 2 - 

(^2 _ rryi , 

an — netn — ^ - (o _ output function) (3) 

^n' ,n 

with n' € Ni and 

(c) for rule neurons n £ N^: 

On = netn = min {W 2 fi{n' , n) • o„/} (4) 

n'eN2 

(d) for output term neurons n 6 A^ 4 : 

a„ = min(l,net„) netn = X) W 3 ^i{n' ,n) • On' (5) 

n'eNs 

(e) for output neurons n E N^: 

an- netn^ X! W^ 4 , 5 (n', n) • o„/ (6) 

O'n'.n • On' n'eN4 

n'eN4 

with W 4 ^ 5 {n',n) = 
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3.2 Two-step Hybrid Learning Algorithm 

The learning algorithm allows the determination of a suitable network struc- 
ture as well as the calculation of the connection weights. 

The changes in the network structure produce simultaneously a modification 
of the membership functions of the linguistic terms and the linguistic rules. 

To start the learning process, we must specify the number of the fuzzy par- 
titions for each input and output variable. 

In the first phase the provisional membership functions of the terms are de- 
termined for all linguistic variables E • • • ^m}« 

The learning rule of KOHONEN is used to find a mean value for each 
membership function with a cluster method. The variance of the bell-shaped 
function can be determined with the help of an N-nearest-neighbors 
heuristic. 

After calculation of the membership functions the conditions of the linguis- 
tic rules are decided. For it a competitive weight is determined for each 
connection Wz^^{v!-,n) with n' G and n G A 4 . 

For the implementation there is used the competitive learning rule of GROSS- 
BERG: Wn'^n{t) = On * (~ uin',n + On') ^ where On serves as a win-loss index of 
the term Bj (node n at layer four). 

By the use of the EULER-CAUCHY-Integration method the differential 
equations can be solved numerically. 

In conclusion, for each rule neuron n' the connection W (n', n) to a term neu- 
ron riBj of a linguistic variable rj is always chosen for which the competitive 
weight Wn'^n has the gratest value. These connections remain in ^ 3 , 4 , where 
all other connections are deleted. 

At the end of the self-organized phase a set of linguistic rules can replaced 
by a rule with rule combination, if three conditions are met: 

(a) the rules have exactly the same consequences, 

(b) preconditions are common to all the rule nodes in this set, 

(c) the union of other preconditions of these rule nodes composes the 
whole term set of some input linguistic variables. 

In the second phase, by a gradient method after the principle of Backprop- 
agation it is attempted to find the optimal parameters for the membership 
function of the terms of the linguistic variables. 



4 Results 

4.1 Design and Implementation of an Intelligent Tool 

The development of the tool reproduces the working steps of the user for the 
construction of an optimal knowledge base on the corresponding functions 
of the software product. Schulze (1997) develops the components data ad- 
ministration, knowledge integration and extraction, knowledge acquisition, 
and graphical evaluation for a Fuzzy-Rule Generator. 
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4.2 Generation of an Optimal Knowledge Base 



For the rule generation the measuring series of the years 1992-1994 are used 
as the learning set for the neuro-fuzzy system (NFS) and the measuring val- 
ues of the year 1995 as the test set. 

With the help of the correlation coefficients of the data it is found that there 
are strong connections between and the measured values 03 {t) and 

T(^ + l). 

The knowledge base generated with the NFS describes the functional con- 
nection between these parameters with 22 rules. The root mean square error 
(RMSE) is The partitions of the output variable r]{n + 1) are 

represented by the curves in figure 3. 




O3 (t) 



Figure 3: The partitions of the linguistic output variable 0^(1 

In the figure the abbreviations VVL, ... VVH mean the corresponding lin- 
guistic terms ’’very very low” ... ’’very very high”. The rules are represented 
in the rules table (table 1). 



T(t+1) 


VVL 


VL 


LOW 


Osit) 

MED 


HIGH 


VH 


VVH 


VVL 
















VL 


LOW 


VL 


VL 




LOW 


LOW 




LOW 




VL 


VL 


LOW 




HIGH 




MED 








MED 




MED 


HIGH 


HIGH 






LOW 




VL 






VH 




LOW 


LOW 


HIGH 


VH 


VH 


VH 


VVH 










VH 


VVH 





Table 1: Rule table for the linguistic output variable 03 (^ + 1). 

The linguistic rules well refiect the connection between temperature and 
ozone concentration, but for Os{t-\-l) a maximum over 180^^/m^ is predicted 
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only if the following conditions are met 
140fig/m^ < Os{t) < 160fig/m^ and T{t + 1) > 35°C. 

For this reason the four ozone concentrations over 180/i^/m^ of the year 
1995 are not recognized by the system. 

With the NFS 39 experiments were performed. The best results of the 
prognosis for the maximum of the ozone concentration during the day (^+1), 
measured with the RMSE (17.9/x5/m^), are achieved with the input variables 
Os{t), T{t) and AT{t). Two of the four ozone concentrations over ISOfig/m^ 
were recognized. The 45 generated rules show a better prognosis behavior 
than the rules of experts (see 2.4). 

In summary, we can say that the neural Fuzzy System for the ozone prognosis 
only provides a relatively good prediction, if the input patterns have few 
features and the knowledge base, relating to the size of the learning problem, 
has a small rule set. Further experiments concerning features and terms are 
necessary. 
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Attributing Shares of Risk to Grouped 
or Hierarchically Ordered Factors 
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Abstract: The attributable risk has been introduced as an epidemiological pa- 
rameter that quantifies one proportion of disease events in a population that can 
be assigned to the adverse effects of certain risk factors. Recently, this concept has 
been generalised to the problem of simultaneously assessing n proportions of dis- 
ease events, called partial attributable risks, that can be ascribed to the individual 
effects of n factors. The paper describes hierarchical and grouped variants of these 
parameters that appropriately combine the properties of partial attributable risks 
with the additional possibility of grouping and/or hierarchically ordering multiple 
factors to allow for more realistic multifactorial frameworks. 



1 Introduction 

Epidemiological methods of risk attribution are concerned with the assess- 
ment of the proportion of disease events in a population that can be ascribed 
to risk factors under study. The classical risk parameter applicable to such 
problems was introduced by Levin (1953) in order to quantify the “propor- 
tion of lung cancer attributable to smoking”. The dynamic development 
of multifactorial statistical methods during the last decades had a remark- 
able impact on the definition of this so-called attributable risk parameter 
and its estimation. In particular, the derivation of adjusted attributable 
risk parameters {AR) and their model-based estimation were subject of in- 
tensive research activities (Gefeller (1992), Greenland and Drescher (1993), 
Coughlin et al. (1994), Basu and Landis (1995)). 

The AR quantifies one proportion of the risk of disease that can be at- 
tributed to being exposed to a certain set of risk factors. Apart from this, 
epidemiological research on the impact of risk factors on the disease load 
in the population frequently demands the simultaneous assessment of multi- 
ple proportions that can be attributed to the respective individual effects of 
the factors. A necessary condition for any risk attribution method to prove 
adequate in that situation is that it sufficiently incorporates the concept of 
interactions between the exposure factors. Recently, the partial attributable 
risk {PAR) was developed as a new multidimensional parameter that meets 
these requirements (Eide and Gefeller (1995), Land and Gefeller (1997)). 
The interpretation of PAR, however, is based on the implicit assumption 
that all exposure factors can be treated as if they were of the same a priori 
interest for potential preventive measures. 

In this paper an epidemiological study on the association of obstructive lung 
diseases with smoking and occupational exposure to quartz dust (Bakke et 
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al. (1991)) based on a random sample of 4469 persons in the population of 
Hordaland, Norway, exemplifies that this assumption is frequently violated. 
The third and fourth sections thus introduce new hierarchical {HPAR) and 
grouped {GPAR) variants of the PAi?-parameter. These new parameters 
combine (i) the possibility of grouping or hierarchically ordering multiple 
exposure variables in order to allow for more fiexible and realistic multifac- 
torial models that account for the different roles exposures play in practice 
with (ii) the desirable properties of PAR. Above all, it can be proved that 
they are unique in the sense that they are the only parameters simultane- 
ously having these properties. Throughout the paper their characteristics 
are illustrated by data from the Hordaland study. 



2 The Partial Attributable Risk 

The PAR, as well as its hierarchical and grouped variants, use the following 
formalization of attributable risks in the multifactorial situation. The ex- 
planatory variables are divided into a set of n categorical exposure variables 
El, ... ,En of primary interest and an additional variable C with categories 
1, . . . , / , specifying different subpopulations. For i G {1, . . . , n} let Ei have 
the categories 0, . . . , — 1, where = 0 denotes no exposure to the i-th 

risk factor. D is a binary random variable indicating whether a subject de- 
veloped the disease [D = 1) or not [D = Q). For a subset E C {Ei, . . . , En}, 
let all combinations of levels of variables not in E define the strata 

Sc,e = {C = C} n fl {Ej = tj), 

Ej^E 



where c G are categories of C and e = {ej)Ej^E denote vectors 

of categories of variables in {Ei, . . . , En} \ E. The combined adjusted at- 
tributable risk of all variables included in E is 

. = 1) - F(S„) • PW = 1|S„ n DfieEia = 0}) 

= P{D = l] • 

AR{E) quantifies the percentage reduction in the disease rate when in each 
stratum the population is prevented from exposure to any risk factor that 
is described by a variable included in E (Coughlin, Benichou and Weed 
(1994)). 

In the Hordaland study, for instance, the presence of respiratory disorders 
was related to the following exposure variables of primary interest: 

• S (smoking, 5 categories): 5 = 0 if the subject has never smoked daily, 

• O (occupational exposure, binary): O = 0 if the subject was never 
exposed to dust or gas, 

• A (age, binary): A = 0 if the subject is at most 16 years old. 
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• X (sex, binary): X = 0 for females, 

Maximum likelihood estimates of the adjusted attributable risks for the oc- 
currence of breathlessness during exercise in the population of Hordaland 
are summarized in Table 1. They were derived by logistic regression analy- 
sis (Greenland and Drescher (1993)) as detailed by Eide and Gefeller (1995). 
The quantitative information in the table does not allow for easily assessing 
the contributions of single exposures. Most methods of attributing the risk 
of disease to multiple risk factors are thus based on the idea of assessing 
exposure-specific effects by successively incorporating the variables into the 
analysis while simultaneously calculating their respective additive contri- 
butions to the combined attributable risks. These contributions are called 
sequential attributable risks (SAR^), where tt = {Ei^ • • • Ei^) denotes a per- 
mutation of the variables which specifies the order of incorporating them 
into the analysis: 

SAR^^.^ — AR{{Ei^^ , . . ^Ei^]) — AR{{Ei^^ . . . and 

SAR^.^ = AR{{Ei^}) for the special case = 1. 

In the Hordaland study, for example, the sequential attributable risks were 
estimated with respect to the permutation tt = (O A 5 X) as 

= 0.1334, 5 AR 5 = 0.2076, = 0.4584, = 0.0620. (1) 

The SARs of a single exposure depend on the ordering of variables. Gonse- 
quently, they are not applicable for multifactorial risk attribution in the case 
of exposure variables without any “naturally” given ordering. In that case, 
the SARs are averaged among all possible sequences in order to calculate 
the partial attributable risk {PAR) for the variable Ei\ 

PARe, = -,Y.SAR% 

n. ^ 



Table 1: Estimates of ARs of exposure to 5, O, A, or X on the occurrence 
of breathlessness during exercise in the Hordaland study 



estimates of 
(combined) ARs 


adjustment 

variables 


estimates of 
(combined) ARs 


adjustment 

variables 


AE({S}) = 0.2598 


0,A,X 


AR{{0,X}) = 0.3693 


S,A 


AiZ({0}) = 0.1334 


S,A,X 


= 0.6355 


S,0 


= 0.5164 


s,o,x 


= 0.7994 


X 


AR{{X}) = 0.2498 


S,0,A 


^i?({S,0,X}) = 0.5417 


A 


v4iZ({5,0}) = 0.3527 


A,X 


A, X}) = 0.8333 


0 


/!})= 0.7666 


0,X 


ARi{0, A, X}) =0.7022 


s 


^il({5,X}) = 0.4601 


0,A 


0, .4, X}) = 0.8614 


none 


AR{{oA}) = 0.5918 


s,x 
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With reference to the example above the PARs are estimated as 
PARs = 0.2096, PARo = 0.0795, PA^Ra = 0.4177 PARx = 0.1547. (2) 

The PAR of an exposure variable can be interpreted as the expected pro- 
portion of cases of disease preventable by the additional elimination of this 
exposure after a random collection of exposures had already been eliminated 
in the population (Land and Gefeller (1997)). The PAR has desirable prop- 
erties, which are outlined in the following. 

1) Component additivity: ^i=iPAREi = AR{{Ei^ . . . ^ En})^ which means, 
that the sum of P AR-components quantifies the reduction in the disease 
rate hypothetically achievable by completely eliminating all exposures in 
the population. 

2) Internal marginal rationality: If variables Ei^Ej G {Ei, . . . , En} satisfy 

AR{EU{Ei})>AR{EU{Ej}) for all E C {Ei, . . . , En} \ {Ei, Ej}, (3) 
then PAREi > PARe^ (4) 



This property can be illustrated by the estimated attributable risks in the 
Hordaland study. Table 1, for instance, shows that 

M{{X}) > AR^}), M{{S, X}) > AR^S, O}), 

AR{{A, X}) > Ar({A, O}), AR{{S, A, X}) > AR{{S, A, O}). 

These inequalities indicate that sex differences contribute more to combined 
attributable risks than occupational exposure. From (2) and (1) it can be 
seen that the PAR reflects this predominance of X, which is a consequence 
of the PAR being internally marginally rational. Note that the SAR does 
not have this property, which can be seen by comparing its estimated com- 
ponents for X and O in (1). 

3) Marginal rationality: Suppose that AR\ • ) and AR^\ • ) denote at- 
tributable risks for populations I and II and that the same set of exposure 
variables Ei, En is considered for both populations. If a variable Ei and 
all subsets E C {Ei, . . . , En} \ {Ei} satisfy 

AR\E U {Ei}) - AR\E) > AR"{E U - AR"{E), (5) 

then the i-th risk factor contributes at least as much to combined at- 
tributable risks in population I as in population II. The PAR reflects this 
ranking when it is calculated separately for both populations: PAREi ^ 
PARei' While internal marginal rationality ensures a reasonable ranking 
of exposures when they are compared with respect to their impact on the 
disease load in the same population, the related property of marginal ratio- 
nality focuses on the comparison of the impact of the same exposure with 
respect to different populations. Furthermore, no method apart from the 
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PAR can be defined that simultaneously holds all three properties. Proofs 
are provided by Land and Gefeller (1997), and the practical implications of 
these characteristics for epidemiological research are discussed in Gefeller et 
al. (1998). 



3 Hierarchical Variants of PAR 



The SARs and the PAR occupy different extremes in the concept of mul- 
tifactorial risk attribution: the PAR is suited to assess exposure-specific 
contributions to the combined AR of multiple a priori equally ranking expo- 
sures, whereas the SARs are reserved to the analysis of completely hierarchi- 
cally ordered variables. Both extremes, however, will rarely be appropriate 
to modeling the different roles that exposures play in practice. In the Horda- 
land study, for instance, the exposures are not of the same interest. In fact, 
smoking and occupational exposures are promising targets for preventive 
measures, whereas age and sex are not. In order to account for the qualita- 
tive differences between the variables in the Hordaland study, multifactorial 
risk attribution ought to be based on a new hierarchical variant of the partial 
attributable risk (HPAR) that (i) imposes a hierarchy among the variables’ 
classes {5,0} and {A^X} and (ii) treats the variables within the classes as 
a priori equally ranking. 

In order to give a formal definition of HPAR let the set (£'i, . . . , £■„} of ex- 
posure variables be entirely partitioned into m exposure classes T^i, . . . , Urn 
with respective cardinalities /ii, . . . , Hierarchical risk attribution is based 
on hierarchically arranging these classes. In practice, natural orderings of 
classes are induced by certain epidemiological considerations. Formally, how- 
ever, an ordering is realized by introducing the concept of hierarchical per- 
mutations. The latter are permutations of exposure variables satisfying the 
hierarchy principle: for k = I, . . . , m, the first + • • • + /ijfc variables are 
included in U • • • U 7/^- Finally, for a variable Ei let 



HPARe, = 



E 

)T hierarchical 



SARI. 



Important mathematical properties of the HPAR are discussed in the fol- 
lowing. 

1 ) Hierarchical component additivity: For G (2, . . . , m] the sum of compo- 
nents Y^EieUk HPARe^ is equal to AR{'HiU . . .URk) — AR{HiU . . 

For the special case A; = I the sum Y^Eieni HPARe^ is equal to AR{'Hi). 

2) Hierarchical variant of internal marginal rationality: For all Ei^Ej G Rk^ 
k G {I,...,m}, relation (3) implies that HPAREi > HPARe^ While 
internal marginal rationality allows for ‘‘fairly” comparing the PARs of dif- 
ferent exposure variables to each other, the hierarchical variant is restricted 
to comparing the HPARs of exposure variables in the same class. 
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3) Marginal rationality: For all Ei relation (5) implies HPAR^^. > HPAR^J.. 
Above all, it is noteworthy that no method of epidemiological risk attribution 
in the hierarchical context apart from the HPAR can simultaneously fulfill 
marginal rationality and the hierarchical analogues of component additivity 
and internal marginal rationality. 

In the Hordaland study the classes are Ri = {5, 0] and H 2 = {A,X}. The 
HPARs are estimated as 

HPARs = 0.2396, H^ARq = 0.1132, (6) 

HPARa = 0.3832, HPARx = 0.1255. (7) 

There are major differences between the interpretations of estimated PARs 
and their hierarchical variants in equations (2), (6) and (7), respectively. 
First of all, the estimates of HPARs and HPARq are entirely adjusted 
for age and sex differences, whereas the corresponding estimates in (2) are 
not. This adjustment, however, leads to a more realistic assessment of pro- 
portions of hypothetically preventable cases of disease, because it accounts 
for the fact that preventive measures cannot affect sex and current age and 
that the relative effects of prevention can be different in the four subpopula- 
tions defined with respect to all joint age-sex-categories. Furthermore, the 
estimated parameters in equation (7) can be interpreted as the respective 
“residual attributable risks” of age and sex after the hypothetical elimina- 
tion of smoking and occupational exposures in the population. Finally, the 
sum of HPARs ^md HPARq is equal to the combined effect of eliminating 
smoking and occupational exposures (adjusted for age and sex), whereas 
the sum of parameters estimated in (7) quantifies the percentage of cases of 
disease that can be attributed to age or sex differences after smoking and 
exposure to quartz dust were eradicated in the population of Hordaland. 
This is a consequence of hierarchical component additivity. 



4 Grouped Variants of PAR 

The use of HPARs demands for (i) dividing up the variables into exposure 
classes and (ii) imposing a hierarchy among the classes. Apart from this, 
however, in some epidemiological studies (i) is of particular interest, whereas 
(ii) is not. Suppose, for example, a study focuses on contrasting the influ- 
ence of several occupational exposure variables to multiple life-style variables 
(like smoking, alcohol consumption) in order to provide an epidemiological 
basis for judging problems of liability or compensation claims. In that situ- 
ation, multifactorial risk attribution ought to allow for fairly contrasting the 
summarized contribution of occupational exposures to the disease burden 
with the summary contribution of life-style factors. Note, however, that 
summing up the PARs or the HPARs of exposure variables over the re- 
spective classes does not lead to meaningful values that can reasonably be 
compared to each other. Thus, attributing the risk of disease to multiple 
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factors which are divided into classes without hierarchies ought to make use 
of a new grouped variant [GPAR) of the partial attributable risk. 

In order to formally introduce this parameter, let the exposure variables be 
arranged in m disjoint groups , Qm^ If is the objective of grouped risk 

attribution to preserve the desirable properties of HPARs while their hier- 
archical structures are broken up by means of an averaging process among 
all hierarchical orderings of groups. The GPAR of a specific exposure Ei is 
thus defined as 



GPARe, = 



E 

permutations a 
of 



HPAR%. 

m\ 



where HPARei denotes the HPAREi with respect to the hierarchical ar- 
rangement Ri = cr(^i), . . . , 7/m = Cr{Qm)’ 

The vector of GPARs quantifies the average contributions of individual risk 
factors to combined ARs in a grouped multifactorial framework. It has the 
following mathematical properties: 

1 ) Grouped component additivity: For A; = 1, . . . , m 



where the categorical exposure variable has ki categories spec- 

ifying all possible combinations of categories of the variables included in 
s-nd where = 0 if and only if Ei = 0 for all Ei E Q^. Note that 
AR{Fi , . . . , Em) is equal to AR{Ei , . . . , En) and that PARe,^ is the partial 
attributable risk of the A;-th group, achieved by partitioning AR{Fi , . . . , Em) 
among the variables Fi, . . . , F^. The group-specific sums of the GPARs are 
thus equal to the partial attributable risks of being exposed to any risk fac- 
tor included in the respective groups. 

2) Grouped variant of internal marginal rationality: For all Ei,Ej G Gk^ 
k G {!,..., m}, relation (3) implies GPARe^ > GPARej- Furthermore, if 
groups Qk, Qi G {Gi, • • • , Gm} satisfy 

AR(E U Gk) > AR{E U Gi) for all E C {^i, ...,En}\{Gk^ Gi}, 

then the sum T,Ei€Qk GPARsi is at least as high as T,EjeGi GPARej. There- 
fore the GPAR allows for both a fair comparison of exposure variables 
whithin groups and a fair comparison of entire groups to each other. 

3) Marginal rationality: For all Ei relation (5) implies OPAR^i ^ GRAR^.- 
Finally, it can be proved that no multifactorial risk attribution method apart 
from the grouped variant of the partial attributable risk can simultaneously 
fulfill marginal rationality and the grouped analogues of component additiv- 
ity and internal marginal rationality. 
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5 Conclusions 

In multifactorial frameworks, the concept of epidemiological risk attribu- 
tion needs special methodologic care to take combined effects of risk fac- 
tors into account. The concept of partitioning attributable risks using the 
PAi?-parameter meets these requirements. The HPAR and the GPAR 
appropriately combine the properties of the PAR with the additional pos- 
sibility of arranging multiple variables more flexibly in realistic frameworks. 
Above all, it can be proved that no multifactorial risk attribution method 
apart from the hierarchical (grouped) variants of the partial attributable risk 
can simultaneously fulflll marginal rationality and the hierarchical (grouped) 
analogues of component additivity and internal marginal rationality. The 
HPAR and the GPAR are thus uniquely determined by these properties. 
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Abstract: Testing for the bioequivalence alternative Hi : |^| < A in a nor- 

mal univariate setting is usually performed with the well-known ” two-one-sided- 
t- tests procedure”, which is an intersection- union test. Based on the intersection- 
union principle recently tests for the multivariate rectangular alternative Hd : 
\6i\ < A have been constructed. In this paper we show that rectangu- 
lar hypotheses are not always a suitable generalization of H\ to the multivariate 
setting. In order to overcome the drawbacks encountered with rectangular hy- 
potheses we suggest ellipsoids as alternatives. Finally, an asymptotic test for the 
new hypotheses is suggested and compared with existing methods. 



1 Introduction 

In the standard bioequivalence problem two different formulations of a drug 
are to be shown as similar with respect to their availability in the blood 
circulation. Typically, several pharmacocinetic parameters are of interest. 
For example, one might consider the area under the drug concentration 
vs. time curve (AUC), the maximal concentration (Cmax) and the time 
to achieve the maximum concentration (Tmax)- These measures are by no 
means exhaustive, e.g. Steinijans et al. (1995) suggest the consideration of 
the shapes instead of the absorbtion rates of the concentration time curve 
which leads to a slightly different definition of bioequivalence. Here the mean 
absorbtion time (MAT) or the mean residence time (MRT) are custom quan- 
tities. Other approaches base the analysis on transformed measures, such as 
C'max/AUC, which was shown by Endrenyi et al. (1991) as a characteristic 
which is independent of intra-subject variations. However, all these quan- 
tities are obtained from the same subject and hence the statistical analysis 
should be performed in a multivariate set up. Indeed, drug authorities, such 
as the FDA, currently require that in a single dose bioequivalence study of 
oral drug formulation bioequivalence is shown with respect to both, AUC 
and Cmax (FDA (1992), EC-GCP (1993)). The corresponding test is to de- 
clare bioequivalence if every univariate comparison of the pharmacokinatic 
parameters AUC and Cmax allows one to declare bioequivalence at level of 
significance a = 0.05. Obviously, such a proceeding leads to conservative 
procedures, and therefore, the assessment of multivariate bioequivalence has 
received great interest during the last five years (cf. Schall et al. (1996), Hsu 
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et al. (1995) or Berger and Hsu (1996)). Following current FDA-practice, 
all these authors consider the multivariate bioequivalence problem as a test 
problem for a hyperrectangular alternative 

Hi : max |0j| < A (1) 

where 9i denotes a measure of discrepency of one of the characteristics men- 
tioned above. In this note, we argue that the rectangular alternative (1) is 
not appropriate for decision making in multivariate bioequivalence. In Sec- 
tion 2.2 we suggest ellipsoidal hypotheses instead. In Section 3 several tests 
for the new hypotheses are considered. A new asymptotic test is suggested 
and various finite sample modifications are discussed. In Section 4 we illus- 
trate the features of these hypotheses and tests in an example previously 
discussed by Chinchilli and Els wick (1997). 



2 New Hypotheses 

2.1 What is Wrong with Rectangular Hypotheses? 

In order to understand our main criticism of rectangular hypotheses consider 
the comparison of AUC and Cmax in the most common setting of a 2 x 2 
period crossover design (Chow and Liu (1992)) in a multiplicative model. 
Due to federal regulation (FDA (1992)) it is to prove that the ratios of the 
expectations of the test (T) and reference (R) formulations stay within the 
limits Al = 0.8 and A 2 = 1.25. After taking logarithms the corresponding 
rectangular alternative is (cf. Brown et al. (1998)) 

^ ■ I^AUC “ ^AUCl - ln(1.25), l^cmax “ ^Smaxl - 



Observe that in this alternative the “extremes” 

Wuc “ ^AUC’ ^Cmax ~ /^Cmax) = (2) 

are considered as equivalent. The same holds for the case of “marginal 
perfect equivalence” 

/^AUC“^AUC = ^ ^Cmax “ ^Cmax 

and vice versa. This implies that concentration-time curves are considered 
as bioequivalent (and hence as therapeutically equivalent) for the “extremal” 
case (2) without allowing for exceeding the value A ln(1.25) in one co- 
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ordinate, when perfect equivalence (i.e. the 
parameters coincide as in (3)) in the other 
coordinate is present. Certainly this contra- 
dicts our notion of “equivalence” as a mea- 
sure for the distance from “perfect equiva- 
lence” in each direction. Hence a measure 
which comprises the distance from perfect 
equivalence as a weighted sum of distances 
from each direction seems to be more appro- 
priate. Figure 1 gives a graphical illustration 
of the problem and already points to the so- 
lution in the following section. 

2.2 Ellipsoidal Hypotheses 

As seen in the last section a measure of discrepancy which combines both 
aspects Cmax and AUC appropriately, seems to be more adequate. Of 
course, the euclidean norm is a natural and simple choice for such a measure 
of equivalence. In order to allow that each metric on the axis can be rescaled 
appropriately, we consider, more general, quadratic forms Qa{S) ~ O^AO^ 
associated with positive definite matrices A > 0. Hence the new testing 
problem is 

Hq : 6^A9 > A versus Hi : 6^A6 < A. (4) 

Observe that the alternatives H\ : O^AO < A are ellipsoids QaA' According 
to the principal axes Theorem the principal axes of Qaa coincide with the 
maximal treatment differences of the respective pharmacocinetic parameters, 
the experimenter is willing to tolerate in this direction. The most simple case 
where A equals the unity Id corresponds to a d-dimensional sphere S^. When 
some characteristics are considered as less important or known to have more 
variability a weighted euclidean norm becomes more adequate, i.e. A should 
be chosen as diagonal A = diag{ai ^ . . . , a^), where each entry aj represents 
the weight to be chosen for the coordinate 6j, j = 1, . . , ,d. Transforming the 

data Xi into -^A^Xi (recall that A allows a decomposition as As A 2 == A) 
we have reduced the testing problem to the standardized form 

Ho :e*e>l versus Hi : 6^9 <1, (5) 

Hence constructing and assessing the performance of the tests for (4) can 
always be reduced to this case. 




3 Bioequivalence Tests for Ellipsoidal 
Hypotheses 

In the following we propose two different methods to test the hypotheses in 
(5). The first method gives exact a-level tests although rather conservative, 
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whereas the second approach is asymptotically and leads to more liberal and 
powerful procedures. 

3.1 Confidence Inclusion Rules 

Let 



Xi,...,X,-iV,(0,S)i.i.d. (6) 

according to a d- variate normal distribution with expectation 9 and covari- 
ance E unknown. Note that for a 1 — a confidence set Ci-a(Xi , . . . , Xn) for 



p,(Ci_a(Xi,...,x,)cr)<a V0^r 

for any measurable set F C IR^. Choosing F as the alternative T = {6 : 
6^6 < 1} this establishes 

±(Y Y ^ / ^ • • • ’ ^n) C F 

i’--*’ ri) • I Q . otherwise 

as a-level test for the test problem 

Hq : 6 versus Hi : 9 eT. 

Munk and Pfliiger (1998) even showed that one can increase power by select- 
ing a 1 — 2of confidence region without increasing the level of the resulting 
test. This somewhat surprising result depends heavily on the convexity of F 
and a suitable equivariance property of the confidence region Ci-a- In Par- 
ticular it is shown that the confidence set associated with Hotelling’s T^-test 
(Anderson (1984)) fulfills these properties. Its rejection region is given as 

A^^(r) = {(x,E): c?!2„cr}. 

Here the confidence region is 

0^-2. := {«0 : rj < fLin) 

where 

Tl'.= n{x-e,yt-\x-e,), 

with X := lY.UXu t -.= - X){X, - xy and := 

denotes a central F - distribution with Ui and V 2 

degrees of freedom. 
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3.2 Testing with Quadratic Forms 



The_ following simple approach is based on the direct estimation of 6^6 by 
X^X. Certainly, it is reasonable to reject Hq : 6^6 > 1 for small values of 
X^X. Hence it remains to determine the critical value ka such that 

Pe{X^X <kc)< a for 9^6 = 1, 

Unfortunately, the finite sample distribution of X^X follows a complicated 
X^-type and hence we suggest the following asymptotical procedure. An 
application of the multivariate central limit Theorem together with the 6- 
method and Slutzky’s Theorem (cf. Serfiing (1980)) leads to 

\/4XttX 

so that the critical point can be taken as 

, 2^-\a)\lx^tX , 

ka = + 1 

\/n 

with $“^(a) being the a-quantile of the standard normal distribution. Note, 
that this test is robust with respect to the normality assumption of the data 
(A'i)i=i,...,n as long as 

X^tx A (7) 

We mention, that from the weak law of large numbers condition (7) holds 
under rather weak assumptions, such as F||Xi||^ < oo. 

In order to improve the accuracy of the approximation to the (skewed) dis- 
tribution of X^X we investigated the following two distributions: 

• a px/“distribution to fit the first 2 moments and 



• a {gx}{f) + 6)-distribution to fit the first 3 moments 



Here xliP) denotes a noncentral x^-distribution with noncentrality para- 
meter /? and u degrees of freedom. This gives 



_ iirace(S2) + f _ { 9^6 + iirace(E))2 
eW + ktraceii:) ^ ^trace{i:^) + 



in the first case and 









, = and 



with ■■= E?=iA|(l - s6|), s = 1,2,... where X*X=E%iXiUj + bjf 
and Ui,...,Up ~ N{0,1) i.i.d. the second case (cf. Matthai and Provost 
(1992)). The estimation of the parameters g, f and b invoke the estimation 
of rather simple quantities such as trace{T,). Hence computationally these 
approximations remain tractable. For further details see Pfliiger (1997). 
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4 A Data Example 



The equivalence of availability of Ibuprofen in the blood of two drugs is to be 
investigated for data presented by Chinchilli and Elswick (1997). Following 
these authors it is to be tested whether the ratio of the AUC's (Area under 
the Curve) and the Cmax’s (maximal concentration) are between 0.8 and 
1.25. 

In detail, it is assumed that n independent bivariate observations with 
marginals 

AUCf log N{ri^,a\), AC/Cf ~ log , a^), i = 



are observed in a 2 x 2 crossover design without period effects. It is to assess 
whether 



Hi : 0.8 < 



E[AUCf] 

E[AUC{^] 



< 1.25. 



(An anologous assumption and equivalence restriction hold for Cmax-) Let- 
ting E[AUC^] = E’a+ 2 '^a (Berger and Hsu (1996)) this reduces to 

Hi : \Va - Va\ < log 1-25 and |r?5 - Vc\ < log F25 (8) 



where denotes the corresponding expectation of log Cmax- Taking 

log AU Cf — log AU Cf' log Cmax,i ~ log C'max,^ 

•= ] and A2i := 

log 1.25 log 1.25 

yields the model (6) assumed in 3.1 with d = 2. Furthermore the testing 
problem (8) is refomulated into 

Hq : 6^6 > 1 versus Hi : 9*9 < 1. (9) 



p. 


AUCr 


AUCt 


Cmax^R 


CmaXiT 


P. 


AUCr 


AUCt 


C'max.R 


CmaXyT 


1 


139.12 


140.83 


29.81 


26.60 


14 


100.01 


96.47 


31.64 


25.88 


2 


111.07 


105.61 


33.18 


29.36 


15 


80.15 


83.81 


21.83 


29.80 


3 


86.95 


87.35 


29.75 


31.61 


16 


104.83 


102.16 


25.38 


24.40 


4 


91.15 


102.49 


16.08 


18.43 


17 


105.20 


108.24 


21.07 


26.29 


5 


76.60 


85.22 


28.05 


24.17 


18 


105.86 


95.45 


45.94 


24.95 


6 


63.09 


61.64 


19.53 


17.58 


19 


106.30 


96.73 


29.98 


35.59 


7 


84.96 


92.18 


34.29 


28.15 


20 


106.60 


105.51 


35.29 


24.79 


8 


104.64 


127.22 


30.99 


28.55 


21 


74.52 


91.77 


22.32 


26.27 


9 


143.23 


156.90 


38.50 


38.19 


22 


96.59 


99.24 


24.83 


26.06 


10 


91.15 


90.88 


31.04 


23.64 


23 


97.86 


99.11 


22.37 


24.79 


11 


105.36 


107.00 


31.49 


37.61 


24 


97.15 


89.41 


28.51 


21.61 


12 


75.26 


84.67 


18.69 


22.78 


25 


108.13 


110.68 


38.70 


32.67 


13 


139.92 


152.62 


39.17 


40.49 


26 


128.58 


123.56 


30.82 


29.91 



Table 1: Listing of the raw data 
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We obtain the following summary statistics: 



Mean difference {X) 


Covariance Matrix (E) 


( -0.121332 \ 
V 0.190746 ) 


( 0.1300518 
V 0.1368549 


0.1368549 \ 
0.8743755 ) ' 



In the following table we display the resulting P-values for the hypothesis 
(9) for Hotelling’s inclusion test (Munk and Pfliiger (1998)) in 3.1; for the 
Brown, Casella and Hwang test (Brown et al. (1995)) and for the asymptotic 
tests discussed in 3.2. 



Bioequivalence test P- Value 


Bioequivalence test P- Value 


Confidence Inclusion Rules 
Hotelling’s Set = 0.00050 

Brown’s Limagon = 0.00021 


Quadratic Forms 

Cramer (asympt.) < 0.0000006 

2 moments fit < 0.0000006 

3 moments fit < 0.0000006 



Table 2: Bounds for P-values of five multivariate bioequivalence tests for 

the Ibuprofen data. 

One finds that all these tests lead to the assessment of bioequivalence with 
high evidence. Observe however, that the “quadratic forms” test produces 
give the smallest P-values which leads to the conjecture that these are more 
powerful procedures as the confidence-inclusion rules investigated. 



5 Summary 

It is shown that ellipsoidal hypotheses improve on rectangular hypotheses 
for the multivariate assessment of bioequivalence. Exact level a tests which 
are based on confidence inclusion rules turn out to be rather conservative 
whereas asymptotic procedures based on the pivot X^X are liberal but much 
more efficient. Hence, it remains for the future to develop exact a level tests 
which have reasonable power. 

Acknowledgement: R. Pfliiger was partially supported by the Deutsche 
Forschungsgemeinschaft (Grant GE 637/3-1). 
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Abstract: We present enhancements to learning classifier systems which make 
them capable of generating rules from case-based allergy test data. Iterated clas- 
sification and the integration of experts’ suggestions mark important steps on 
the way to a system assisting allergologists in determining substances which are 
likely to be allergenes for a patient whose history is provided. After describing 
the requirements of this special classification problem we introduce the interactive 
learning classifier system as a powerful tool for these tasks and show some first 
practical results. 



1 Introduction 

Today’s allergology requires the management of large amounts of informa- 
tion on patients. These typically include a patient’s personal data, his pro- 
fession and allergological history as well as the symptoms observed during 
examination and the substances he has been exposed to. Programs support- 
ing the collection and facilitating the processing of these data already are in 
wide use and their popularity is still growing. 

The data collected by the ’’Informationsverbund dermatologischer Kliniken” 
(IVDK, information network of departments of dermatology) contain all 
these data from several hospitals to help investigate possibilities to recog- 
nize upcoming new allergenes as early as possible. Today the IVDK maintain 
databases of more than 50.000 cases. Each of these cases contains a patient’s 
history and the results of all substances he was exposed to. It has become 
obvious that the growing databases require advanced techniques for inter- 
pretation, which led to the decision to design a case-based rule generation 
tool for analysing them. The goal was to develop an algorithm that is able to 
extract a set of rules from the database, which still closely describes its gen- 
eral aspects. Testing the constantly growing databases regularly will show 
developments over the time. 

The rules sets generated may be used to discover new upcoming allergenes. 
But they can as well be used to help an allergologist in determining which 
substances to expose a patient to. This might help to make sure no sub- 
stances that should be tested are overlooked. In addition this might help 
to find connections between patients’ data and allergenes which will not be 
found by intuition. 
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2 Rule-Based Systems 

We decided to use a rule-based system because these store they knowledge 
explicitly. Rules are easy to understand for humans and therefor experts can 
check the ones generated and maybe even learn from them. The system we 
present even makes it possible to determine all cases that led to establishing 
a rule. Finally rules can be interpreted by computers quite fast. 




The rule-based system consists of two parts: one component for the genera- 
tion of rules and one for interpreting them. While the rule generation, which 
examines the data to produce a set of rules describing it as close as possi- 
ble, will be discussed in the next section, we will now focus on the desired 
interpretation of these rules. 

The goal is to only expose a patient to substances which have at least a 
certain probability of causing an allergic reaction. The usual testing proce- 
dure today is to assemble test plans from a number of so-called test schemes, 
which combine substances encountered as allergenes more often in groups 
with certain characteristics in common. There is for example a barber 
scheme, which contains the substances typically barbers react allergic to. 
However, there may be substances in this scheme which are unprobable to 
be allergenes for a special barber for other reasons than his profession. The 
main idea of the rule interpretation is to be able to turn away from these 
test schemes and to instead determine every single substance to expose a 
patient to based on the rules generated. The situation the rule interpreter 
will be used in is the anamnesis of a patient. The allergologists records the 
important data of the patients history which are used as the input of the 
rule interpreter. Based on these information an individual test plan for the 
patient is generated. 

Every single rule consists of attributes and an action part. The attributes in 
this application are a patient’s age, gender, current and former profession, 
the location of the allergic reaction and already known allergic diseases. 
Actions always are suggestions of certain substances. Thus, a typical rule 
might look like this: 
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if (gender == female) and (location = ear) suggest nickelsulfate 

For each rule statistical information on its significance are available. These 
are provided by the rule interpreter to help the allergologist in understanding 
why certain substances are suggested. If after this he still believes this is 
wrong, he can overrule the suggestion or remove the rule. 



3 Learning Classifier Systems 




Figure 2: A learning classifier system 

The sets of rules to be interpreted are generated by a learning classifier 
system (LCS, see e.g. Nissen (1997), 265 ff). This machine learning technique 
produces rules from data presented to it in its so-called environment. The 
reaction to the current situation of the environment leads to an inductive 
learning process, in which regularities are generalized from single events. 
The knowledge already acquired is kept in a set of rules, similar to the 
one described above. Internally the rules are records of a vector specifying 
certain attributes in a machine readable way and a number representing the 
suggested substance. In principle they look like this: 





age 


gender 


curr. 

prof. 


form. 

prof. 


location 


known 

rhinitis 


known 

asthma 


substance 
to suggest 


value 


25 


female 


nurse 


none 


ear 


no 


no 


nickelsulfate 


coding 


3 


1 


314 


0 


17 


0 


0 


1285 
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Possible values for the attributes include wildcards to indicate that one or 
more attributes do not matter for a suggestion. Rules with wildcards match 
any value for these particular attributes. 

In the main loop of the LCS the rules get evaluated but not changed. The 
only way to replace rules from the rule set with new ones is to use the genetic 
algorithm (GA, see e.g. Mitchell (1996)) within the LCS. It generates new 
rules by crossing over existing ones, which simply means constructing rules 
which contain parts of the attributes of other rules. The probability of a rule 
being used in reproduction is proportional to its evaluation. So rules with a 
high evaluation will be used for the generation of new rules more often. 



3.1 Stimulus- Response LCS 

The type of LCS used for the rule generation in our case is called stimulus- 
response LCS, because it immediately responds to a stimulus given to it. In 
this case a stimulus is a presentation of one case from the database. Certain 
rules’ attributes will match this case and these will suggest some substances 
as a response to this stimulus. 

After setting up a usually randomly generated initial set of rules, the sys- 
tem keeps repeating a performance phase followed by a reinforcement phase, 
which are interrupted by a discovery phase in certain intervals. The algo- 
rithm can be stopped after meeting a given goal or a number of iterations. 
In the performance phase a single case from the database is chosen at random 
and presented to the rule set. The LCS now determines the rules with 
attributes matching this case. This subset is referred to as the action set. 
The rules in this action set are evaluated during the reinforcement phase 
by modifying each rule’s strength. Those suggesting substances the patient 
described in the case has reacted allergic to, get rewarded while rules giv- 
ing wrong suggestions are punished. This is done by a credit assignement 
component, as described in details in Wilson (1987). The modification of 
the evaluations is the only action initiated in this phase although LCS in 
general allow changes to the environment, which were not used here. 

Up to now the algorithm is only able to react to the environment and change 
the rules’ strengths, but not to modify any rules. This task is performed 
by a genetic algorithm in the discovery phase. It uses the rules as individu- 
als and applies the genetic operators recombination and mutation to them. 
In recombination new rules are generated by combining two or more rules 
from the current rule set, while mutation simply changes one attribute of a 
rule to form a new one. Removing rules with a low evaluation leads to a 
new, hopefully improved set of rules. In terms of a classical GA, the rules’ 
strengths are considered their fitness. However, the LCS in opposite to a 
GA does not look for one optimal individual but for an optimal population. 
Within this framework some tailoring is required to achieve a good perfor- 
mance for this individual problem. The modifications made for improving 
the rule generation from allergologic test data will be described now. 
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3.2 Problem-Specific Modifications 

Allergologic diagnosis is different from most other classification problems 
in more than one way. The usual approach is to classify by allergenes. 
This means that a classification has to be made for each known substance. 
Thus, single cases can be members of a nearly arbitrary number of classes 
- including no classes for patients who turn out to have no allergies at all. 
Some modifications to the LCS were necessary to be able to use it for this 
special task. 



Database Database Database 




Rule Set #1 Rule Set #2 Rule Set #3 



Figure 3: Rule generation by iterated classification 

The standard LCS was only able to generate rules for the most prominent 
classes within the data sets. These rules were found quite fast. We decided 
to do an iterative approach in which the cases covered by already determined 
rules are no longer used to induce new rules (see figure 3). These rules were 
stored seperately and replaced by rules generated by generalizing randomly 
chosen cases from the database. 

So the LCS will first find rules for often encountered substances, turn to rare 
ones and finally produce rules induced by only few cases. This also makes 
it possible to scale the run time of the classification to the user’s individual 
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demands. If only the most general rules are needed they can be found fast, 
while the detection of more specific rules for rarely found allergenes will take 
more time. 

As another modification a postprocessing phase was added to the LCS to 
remove duplicated and useless data from the set of rules. A certain ammount 
of badly performing rules will be present due to the characteristics of genetic 
algorithms, which do not only produce better rules by recombination but also 
some bad ones. This results in a certain number of rules, which are useless 
for the final set of rules. These rules can be easily detected due to their low 
evaluation. 



4 Interactive Learning Classifier Systems 

The idea of using external sources of information to improve the results of 
the rule generation was one of the main reasons to build the system on a 
LCS base. The underlying genetic algorithms can easily be extended to an 
interactive algorithm. This enables the import of already well known rules 
from special databases and the suggestion of rules an allergologist considers 
to be of interest. 





Genetic Algorithm 


H 


♦ L 


/ 


Set of Rules 





Interactive Genetic Algorithm 

-(^^^ggestions^ 




Environment 







User/Expert 



Figure 4: Integrating an interactive GA into the LCS 

In Albert, Schoof (1997) it has been shown how genetic algorithms can be 
tailored to accept suggestions from a user. It was demonstrated that good 
suggestions improve the optimization, while poor ones result in no changes 
at all. Enhancing the LCS with this techniques enables the allergologist to 
participate in the generation of rules and to guide the program. However, 
the algorithm does not require any suggestions to work properly. 
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The profit of human suggestions varies between problems. They are espe- 
cially useful when human abilities like intuition or experience are needed. 
Problems in mathematical and technical environments usually do not require 
these, but in the medical area they become important. Knowing that they 
can cause no harm, providing the possibility to make suggestions can only 
have positive effects. 

The advantages of this approach are obvious: The expert can easily check 
his assumptions by suggesting them to the LCS and will learn from the 
results. So the LCS is no longer an automatic optimization tool but an 
interactive consulting system and can take any position in between these two, 
depending on the number of suggestions made by the expert and accepted 
by the system. 



5 Experimental Results 




number of LCS iterations 



Figure 5: Data coverage by rule sets found after iterated classification 

First tests showed that the interactive iterated LCS is able to find all relevant 
rules. Typically after six iterations more than 90% of the patients registered 
in the databases are covered by at least one rule. The coverage for substances 
is close to 30% at the same time, which means that rules could be established 
for 30% of all substances (see figure 5). An interpretation of this result has 
to take into account, that for many substances allergies were only found for 
single patients or even no patients at all. Therefore a higher coverage for 
the substances could not be expected. 
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The iteration of classification makes it possible to even find rules for sub- 
stances tested rarely. The fact, that no rule was established twice, shows 
that the exclusion of cases from the induction of new rules really works. 

We are already able to predict 80% of the patients allergenes correctly. This 
is close to the percentage allergologists usually name when asked which frac- 
tion of allergies can be expected to be determined from only a patient’s his- 
tory. The remaining 20% of allergies are considered not to be related to any 
characteristics of a patient but to be adapted by incident. 



6 Summary and Perspectives 

The analysis of allergologic test data can help to improve the methods used 
in future tests and even makes individual test plans possible. The Interactive 
Learning Classifier Sytems presented has proven to be well suited for classi- 
fying these data and for the generation of rules from them. Their ability to 
accept guidance by an expert promises further improvements. Careful fine 
tuning of the system has recently resulted in very good suggestions for test 
plans. We hope to be able to even further decrease the number of substances 
a patient has to be exposed to. 
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Abstract: In medicine, there are often experiments characterized by repeated 
measurements on the same object. A well-known application are dose-finding 
studies where we try to prove a treatment effect in different doses. Establishing 
clinical trials in cross-over design, we have to prove both a dose and a group effect 
of the treatments as well as possible interactions between them. In principal, 
multivariate repeated measures analysis of variance (RM-ANOVA) can be used 
for the analysis of such a data structure (N.H. Timm (1980)). Difficulties arise if 
there are only a few observations available. In this case a parametric analysis of 
variance can not be applied anymore and we have to look for alternatives. In a 
first attempt and for reasons of easy practicability we applied the method of data 
alignment (Hildebrand (1980)) which consists of an adjustment of those factors 
not regarded in the current analysis and a following ranked analysis of variance. 



1 Introductory Examples from Medicine 

Endothelial cells modulate the myocardial contractility under certain phys- 
iological conditions. In experiments on two isolated rat hearts, it should 




Figure 1: Profiles in the groups {LVdP/dt) 

be explained whether the cardiodepressive mediators released after ischemia 
could have its origin in the coronary endothelium. The effect of reoxygenated 
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coronary effluent of heart I was investigated on myocardial contractility of 
heart II. The two hearts were perfused sequentially with (groups I - III) or 
without (controls) preceding ischemia of heart I. In groups II and III, the 




IS IS IS 1$ 16 16 16 U 16 13 13 16 16 16 16 



MEl ME 2 new ME3 
Figure 2: Comparison of the four medications (DP) 

coronary endothelium of heart I was functionally removed. Figure 1 shows 
the change of the left- ventricular contractililty (measured on the differential 
quotient LVdP/dt) of the second heart. 

In the therapy of a heart failure the negative inotropic effect of a /3-blocker 
may deteriorate the haemodynamics. On the other hand, ^2~9,gonists have 
haemodynamic advantages in the treatment of a heart failure. Therefore, 
a combination consisting of a cardio-selective /3i-blocker and a /32-sympa- 
thomimetica should amalgamate the advantages of both. In a clinical trial 
using a cross-over design three different medications were compared with a 
new /?i -blocker with /32-agonistic effect. Figure 2 shows boxplots of the four 
medications (in 4 doses each) for the parameter double product DP (DP: 
systolic arterial pressure x heart frequency) indicating the haemodynamics. 

2 The Multivariate Linear Model Approach 

Given n ^'-dimensional vectors of observations y, = (y;i, . . . ,y;,) indepen- 
dently sampled from n y-dimensional normally distributed populations. In 
the context of ’’repeated measurements” we take the components of as 
repeated measurements. 

We denote the corresponding n expectations and 

S; = Cov [y;,y,] with S/ = S V/ = 1 , . . . , n (1) 

the covariance matrix. The expectations JE^ly;] = ^ may fulfil a linear model 

Y = XB -b E , (2) 

with X the design matrix, B the matrix of parameters and E the matrix of 
errors e (within the objects correlated, i.e. E 7^ I identity matrix). 
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The well-known test statistics and corresponding (1 — a)-confidence intervals 
follow for that model (cf. N.H. Timm (1980)). 



2.1 One-sample Design for Repeated Measurements 

Given n observations sampled from the same population. The q repeated 
observations on every object (patient) may be measured on the same scale 



object (patient) 


observation 1 


observation 2 




observation q 


patient 1 


Mil 


Vu 




y\q 


patient n 


Vnl 


l/n2 




Vnq 



Table 1: One-sample design with q repeated measurements 



(Table 1 shows the design). That corresponds to the model 

E[yi]=H ; yi=i£ + €i ■, 1 = 1,..., n (3) 

and we prove the null hypothesis 

-ffoi : Ml = = ■ • • = ) (4) 

i.e. the repeated measures do not differ from each other (no dose effect). 

2.2 K-sample Design for Repeated Measurements 

Now we suppose a data structure such as given in Table 2. Altogether K 
independent groups with rik objects (patients) {k — 1,...,AT) each were 
investigated and in every group q repeated measures (doses) were used. To 
simplify matters we used different symbols for the groups. Typically, if the 
time between treatments in a cross-over (wash-out) will be much larger then 
the time between repeated measures within a treatment, we may suppose 
independence between the treatment groups in this case as well (cf. Jones 







dose 1 


dose 2 




dose q 


therapy 1 


patient 1 
patient ri\ 


Xu 

^7ll 1 


Xl2 

^ni2 




^Iq 

^n\q 














therapy K 


patient 1 
patient uk 


Zu 

ZukI 


Zl2 

Ztik2 




Zlq 

ZriKq 



Table 2: K-sample design for repeated measurements 



and Kenward (1990)). That leads to the model 

E[ykj] = M*. ; ykj = Hk + ^kj ; k = i,...,K ■, j = i,...nk (5) 
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and we prove the null hypotheses 



H^2-. 



P'll 




' fJ-Kl 


- y'lq . 




. I^Kq . 



; /£i = m2 = = 





Mil 




y>\2 




Ml? 


Hos: 


. MA"! . 


— 


. y‘K2 . 


— • • • — 


. Mat? . 



: 



Mil - Mi2 

Ml(g— 1) Ml? J 



fJ-Kl — MA'2 

L MK(?-1) ~ t^Kq 



( 6 ) 

( 7 ) 

( 8 ) 



which correspond to the questions 

if 02 - Are there differences between the K groups (therapies)? 

i/o 3 * Do the repetitions (doses) differ on each other ? 

7^04^ Are there interactions, i.e. do certain doses exist where possible dif- 
ferences in the therapies are of particular characteristic? 



2.2.1 Results of the Cross-over Study on Heart Failure 

From simulation studies (K.-D. Wernecke (1993)) follows that a multivariate 
analysis of variance can be established if rii/q > 3 (n^ : sample sizes of the 
groups) holds. 

With q = A, rii = 12, ri 2 = 15, ns = 13, ri 4 = 16 we got differences between 
therapy-groups: p = 0.000, dose effects: p — 0.046, interactions therapy- 
dose: p = 0.000, i.e. significant differences between therapies, poor dose 
effects and significant interactions. 

2.3 Univariate Specifications of the 
Multivariate Model 

In the multivariate linear model approach there are no preconditions on the 
structure of the covariance matrix S (only E/ = INI). Now we define a 
univariate model 

E[yih] = i^ih ; yih = y^ih + ^ih ; / = i, . . . , n ; /i = i, . . . g (9) 

with preconditions being specified with regard to certain hypotheses. In con- 
trast to the well-known model of univariate analysis of variance the errors 
tih are now correlated with respect to the rows of Y, i.e. S ^ 
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Notably, according to A. Azzalini (1995) we can not trust on some ’’robust- 
ness” of the F-statistic used in the analysis. Given errors 

= %h + 0%h-i ( 10 ) 

where rjih G iid N{0, a^) and 9 fixed parameter with |0| < 1 then holds 

(J; If |fci/|>l (“) 

(p = 0/(1 + 0^)). Table 3 shows the results (exact values) of the F-statistics 
in testing the row- and column-effect, resp. 



p 


-0.4 


-0.2 


0 


0.2 


0.4 


exact probabilities of the F-statistics % 




test of row effects 


0.03 


1.01 


5.00 


13.05 


24.70 


test of column effects 


5.90 


5.27 


5.00 


5.37 


6.68 



Table 3: Results of the F-test statistics in dependence on the correlation p 

A special approach is the so-called split-plot- design. It supposes a covariance 
structure 





■ 0 • 


• 0 ■ 
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<^b ■ 




E - CT^I-bCTftl'lgx? = 






+ 










. 0 0 


■ ^2 . 




- 







(I'l^x^ matrix with all elements = 1). That means, the correlations be- 
tween any two repetitions of an object (patient) are always the same (so- 
called compound symmetry). In medical practice correlations will normally 
decrease with increasing time lags between the repetitions i.e. compound 
symmetry is not fulfilled. 



3 Nonparametric RM-ANOVA 

The generalisation of the Kruskal- Wallis-test to more than one factor is only 
correct if there is beside the effect under testing no further effect (Erdfelder 
und Bredenkamp (1984)). But the confounding of the effects in a multi- 
factor analysis of variance can be avoided by introducing the principle of 
” ranking-after-alignment” . 

3.1 Analysis of Variance with Ranking-after-alignment 

Hildebrand (1980) proposed a ranking procedure by transforming the data in 
such a manner that the interesting influences - main effects A (groups) and B 
(repetitions) as well as the interactions A x B - can be proved by one-factor 
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A 


dose 1 


B 


dose q 


Akj 


Ak 


g-1 


y\ii 

yinil 




Vnq 

Vlniq 


^11 = J2yuh 

-^Ini ~ yinih 


W 

II 














g.K 


Vku 

VKhkI 




VKlq 

VKriKq 


^K1 = Y^yKlh 
Akuk = HyKriKh 


w 

II 


Bh 


w 

II 




Bq = Vkjq 







Table 4: Design with 2 factors A, B and K groups in q doses each 



analyses of variance without repeated measures (Kruskal- Wallis-test). We 
start with a scheme according to Table 4. In the first step we analyse factor 
A (in average of the repetitions), i.e. we use the marginal total avarages 
A]^j = l/qJ2h Vkjh as new variables and analyse the resulting Table 5 with 
the Kruskal- Wallis test statistics. The prove of factor B (repetitions) can be 
yielded via the transformation 

y'kjh — Vkjh — ^kj — ABkh + Bh (13) 

with ABkh = ^/'rikjyjLiVkjh as averages over every cell. The resulting 



group 1 


group 2 




group K 


All 


A 21 




Aki 


Aim 


A2n2 




Aruk 



Table 5: Ranked analysis of variance with K groups from Akj 

scheme according to Table 6 can again be analysed by the Kruskal- Wallis test 
statistics. In the last step we align the starting values with the group effect 



group 1 


group 2 




group q 


y'ni 

y'ma 


y'ln 

2/lni 2 




y'llq 

Vlmq 










y'Kn 

y'KriKl 


y'xu 

y'Knt2 




y'Klq 

VKukO 



Table 6: One-factor analysis of variance with q groups from the 
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(A) and the effect of factor B (repetitions) in order to prove for interactions: 

Vkjh ~ Vkjh ~ ^kj ~ Bfi + 2G (14) 

with G = l/nY^kj^f^ykjh as total average (n = q • Ylk'^k) in the sense of 
a normalisation (not influencing the ranking). That results in a scheme 



group 11 




group Iq 




group K1 




group Kq 


<11 

<n,l 




y'llq 

yin^Q 




y'kii 

y'kriKl 




y'klq 

y'kn.a 



Table 7: One-factor analysis of variance with K x q groups from 
according to Table 7 being analysed like the others before. 

3.1.1 Results of the Given Applications 

The results from the proposed nonparametric analysis of variance for the 
cross-over study on heart failure are consistent with those from the multi- 
variate computations. Groups: p = 0.009; doses: p = 0.000; interactions: 
p = 0.000. The nonparametric variant was enlarged with pairwise compar- 
isons of the groups which showed only in the groups 1 vs. 2 and 2 vs. 3 no 
significant group effects thus reflecting the underlying data structure. 

In the application from the cardiodepressive mediator data we are confronted 
with sample sizes of ni = 9 (controls), ri 2 = 7 (group I) and = ri 4 = 6 
(groups II and III) for q = 9. With such sample sizes a parametric anal- 
ysis of variance has to be ruled out. Applying the procedure introduced 
before we reached the results: Group effects: p = 0.0022 ; time effects: p = 
0.0000; interactions: p = 0.0000. Beside significant group- and time effects 
relevant interactions can be statuated, i.e. particular group differences in 
characteristic time periods. 

3.2 Post-hoc Analyses 

If it concerns the prove of differences between a control and ^ — 1 > 2 other 
treatments we have always to deal with a multiple test problem. 

3.2.1 Results of the Given Applications 

In the cross-over study on heart failure as a dose-response study we suppose 
increasing effects with increasing dose levels. Then a closed test procedure 
with Bonferroni-correction can be applied proposed by Budde and Bauer 
(1989). Table 8 shows the pairwise results with respect to the new drug and 
another /?-blocker (medic. 1). With a multiple test level of a = 0.05 we got: 
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new drug (n = 13) 


medic. 1 


n = 12) 




1 vs 2 


2 VS 3 


3 VS 4 




1 vs 2 


2 vs 3 


3 VS 4 


p- values 


0.0373 


0.4034 


0.0079 


p- values 


0.0131 


0.0498 


0.0023 



Table 8: Results of the pairwise testing (new drug and medic. 1) 



New drug : Effects in (3,4), therefore in (1,2, 3, 4) and (2,3,4). 

Medic. 1: Effects in (3,4), therefore in (1,2, 3, 4) and (2,3,4); effects in (1,2), 
therefore in (1,2,3); effects in (2,3). 

Because of an effect in time no monotony can be supposed in the mediator 
data. Therefore, we applied the sequentially rejecting Bonferroni-Holm test 
procedure (Holm (1979)). The results are shown in the last Table 9. There 



effect: between ecroups 


LVdP/dt 
(p- values) 


significance 

levels 


overall testing 


0.0004 


a = 0.05 


multiple comparisons 






1 - 3 (1: control) 


0.0032 


a/6 = 0.0083 


1 - 2 


0.0036 


a/5 = 0.010 


1-4 


0.0095 


a/4 = 0.0125 


2 - 3 (2: with endoth.) 


0.1531 


a/3 = 0.0167 


3 - 4 (3: without end.) 


0.3367 


a/2 = 0.025 


2-4 (4: without end.) 


0.6682 


a = 0.05 



Table 9: Results from the Bonferrroni-Holm procedure {LVdP/dt) 

are significant differences between controls and all the other groups simul- 
taneously (multiple level a = 5%) and no significance between groups with 
or without endothelium as the study aimed at. 
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